## Advanced PySpark SQL operations with Python
This notebook covers advanced PySpark, SQL, and Python features. Topics this notebook covers are:

1. Functional-style programming in Python. Though Python is not a functional language, it can be used in a functional style. This style is prevalent in data science and engineering. Also, a good understanding of the map-reduce manner of programming will help understand the RDD, the core PySpark data-structure the DataFrame is built upon.
2. Blending SQL with PySpark Python methods. There are a few PySpark Python functions that accept SQL expressions.
3. Caching in PySpark, or how to improve the speed at which we access frequently requested data.
4. Windows functions, also known as analytical functions, allow you to use the value from one or more rows to return a value for each row. 

In [None]:
# cell for imports
from functools import lru_cache, reduce
from itertools import takewhile

import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession, Window, WindowSpec
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.utils import AnalysisException

In [None]:
spark: SparkSession = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("error")

#### Revisting the Backblaze data using Functional Python
Unlike Scala, the language Spark is written in, Python is not a functional programming language. However it has two excellent functional modules: [functools](https://docs.python.org/3/library/functools.html) and [itertools](https://docs.python.org/3/library/itertools.html) which contain functions inspired by functional languages like Haskell and the ML family of programming languages. These functions, like reduce or accumulate, used in conjunction with lambda expressions, are very common in Python data engineering and science because their use is very succinct.

If you have a particular problem, for instance, you want to take from a collection all the elements until some result is met. You could look at takewhile in the itertools library.

In [None]:
list(takewhile(lambda x: x < 4, [1, 2, 4, 3, 1]))

Of course, this is a quite meaningless example. I could easily make it a more useful example. something with streams of data that you want to take while some stop signal is not met.

A perfect example of functional-style programming is using reduce and lambda expressions to redesign the steps we took to make the Backblaze data usable; see the PySparkSQL notebook for the original steps.

In [None]:
DATA_FILES: list[str] = [
    "drive_stats_Q1",
    "drive_stats_Q2",
    "drive_stats_Q3",
    "drive_stats_Q4",
]

data: list[DataFrame] = [
    spark.read.csv(file, header=True, inferSchema=True) for file in DATA_FILES
]

Previously, we checked the length of the columns and added the extra columns of Q4 to the others. 
This is an option, but inspecting our CSV files shows us that added information is really of little interest to our investigation, so let us just keep the common columns.

In [None]:
common_columns: set[str] = reduce(
    lambda x, y: x.intersection(y), [set(df.columns) for df in data]
)
# fail fast
try:
    assert set(["model", "failure", "date", "capacity_bytes"]).issubset(common_columns)
except AssertionError:
    print("Not all the columns are the same")

full_data: DataFrame = reduce(
    lambda x, y: x.select(list(common_columns)).union(y.select(list(common_columns))),
    data,
)

#### Analysing the code
I packed the cell with three actions. The first action is to create a set of column names that the four quarters have in common. I used a $\lambda$ function, a reduce function, a list comprehension, and a few Python functions to achieve the result. 

- First I created a set of columns for each DataFrame in data: `[set(df.columns) for df in data]`. It is important that I cast the list `df.columns` because I want to use the `intersection` method of the `set` class.
- Than I feed two sets of columns as arguments to the lambda function: `x.intersection(y)`. Basic set theory teaches us that this will return only those columns that both sets have in common. A lambda, or $\lambda$ function, is an anonimous function; it has no name and is not stored separately in memory. It lives as long as the encompassing function lives.
- The reduce function is a programming construct that recursively walks through an iterable and applies a function to the elements, finally putting the result back together. This type of function is more commonly known as a `Fold`. The final result of a fold can be put back together in two ways:
  
1. From right to left -> `reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates (1+(2+(3+(4+5))))` The outermost `(4+5)` is evaluated first. This is the way Haskell `foldr` is evaluated. This type of evaluation leaves a lot of function calls on the memory stack.
2. Python builds it back together from left to right -> `reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5)`. The innermost addition `(1+2)` is evaluated first, this has a smaller memory stack but requires the use of intermediary results, which need to be cached.

In Python you should only use reduce with an operation that has the [associative property](https://en.wikipedia.org/wiki/Associative_property), as $(1-(2-3))\ne((1-2)-3)$. PySparks reduce is only defined for associative operations. In Scala or Haaskel you would use left or right fold as the operation would dictate.   

The second action is a fail fast. Before I do the computationally expensive third step, I make sure that the result is as I expected. If the third action failed, I would have to debug the code, and I might have to trace a long log to find the fault. This way, it is called localize failures, I avoid having to do that. In general you want to fail fast with:
1. Network calls
2. Database start up
3. Securing RESTful APIs with request validation, usually a token of sorts.

More information on fail fast can be found in this article by [Martin Fowler](https://martinfowler.com/ieeeSoftware/failFast.pdf)

The third action does the heavy lifting by performing a union operation on two selects. Again, I use the `reduce` and the $\lambda$ to keep the code terse.

This kind of code, where functional programming concepts such as `comprehensions`, `reduce`, and `lambda` do the heavy lifting, is very common in data science and engineering. You need to be able to read this kind of code. I also use this style of coding when talking about RDDs and UDFs in my notebook about that topic.

In [None]:
full_data.select(["model", "failure", "date", "capacity_bytes"]).where(
    "failure == 1"
).show(n=5)

#### SQL expressions in PySpark
I guess that the above code and analysis of that code were difficult enough, so on a lighter note: There are three PySpark methods that accept SQL expressions:
1. `selectExpr`: Similar to the `select` statement, but just excepts SQL expressions.
2. `expr()`: allows you to use an SQL expression within a standard select
3. `filter`/ `where` accept SQL expressions as well if given as a string

You can use these methods to blend Python code with SQL.

In [None]:
df = spark.createDataFrame([["George"], ["Rhino"]], ["name"])
df.select("name", F.expr("length(name)")).show()

consider the following select statement:

In [None]:
full_data_select = full_data.withColumn(
    "capacity_in_GB", F.round(F.col("capacity_bytes") / pow(1024, 3), 0)
).select("model", "failure", "date", "capacity_in_GB")

full_data_select.show(n=5)

Using the select expression we write this more terse

In [None]:
full_data_gb: DataFrame = full_data.selectExpr(
    "model",
    "failure",
    "date",
    "round(capacity_bytes / pow(1024,3),0) as capacity_in_GB",
)

full_data_gb.show(n=5)

In [None]:
drive_days = full_data_gb.groupBy("model", "capacity_in_GB").agg(
    F.count("*").alias("drive_days")
)

failures = (
    full_data_gb.filter("failure == 1")
    .groupBy("model", "capacity_in_GB")
    .agg(F.count("*").alias("failures"))
)

summarized_data = (
    drive_days.join(other=failures, on=["model", "capacity_in_GB"], how="left")
    .fillna(value=0.0, subset=["failures"])
    .selectExpr("model", "capacity_in_GB", "failures / drive_days as failure_rate")
    .cache()
)

summarized_data.show(n=10, truncate=False)

What can you do with all this information? You can write a simple function to present the most
reliable hard disk drive for its capacity.

In [None]:
def reliable_drive_for_capacity(
    data: DataFrame, capacity_in_GB: int = 2048, precision: float = 0.25, top_n: int = 3
) -> DataFrame:
    """
    desc: Function that returns the top N drives with lowest failure rate for the given capacity
    test: No tests as I need to mock a DataFrame.
    """
    capacity_min = capacity_in_GB / (1 + precision)
    capacity_max = capacity_in_GB * (1 + precision)
    return (
        data.filter(f"capacity_in_gb between {capacity_min} and {capacity_max}")
        .orderBy("failure_rate", "capacity_in_GB", ascending=[True, False])
        .limit(top_n)
    )

In [None]:
reliable_drive_for_capacity(data=summarized_data, capacity_in_GB=11176).show()

In [None]:
reliable_drive_for_capacity(data=summarized_data, capacity_in_GB=6500).show()

You can perform this operation without a join. This solution requires quite some insight into PySpark, 
I would not have come up with this solution. I also wonder if it is the clearest solution for the reader. It involves kind of a trick. 1/0 are equivalent in Python to True/False (they are "truthy" values). We can use sum as both filter (when we sum the 0's won't be counted, we are thus filtering on the ones) and counting clauses. The reason to use this kind of code without a join, is that it is faster.

In [None]:
joinless: DataFrame = (
    full_data_gb.groupBy("model", "capacity_in_GB")
    .agg(F.sum("failure").alias("failures"), F.count("*").alias("drive_days"))
    .selectExpr("model", "capacity_in_GB", "failures / drive_days as failure_rate")
    .cache()
)
joinless.show(n=10, truncate=False)

What does the `cache` function do? If you look up the PySpark documentation on [cache](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.cache.html?highlight=cache#pyspark.sql.DataFrame.cache), you get a perfectly nothing-saying answer: Persists the DataFrame with the default storage level. It really should be saying `cache` stores intermediary DataFrames for later computational use. Meaning that instead of recomputing all results, caching stores some results in memory so later calculations can use these partial results. 

Why is caching so important? Three reasons:

1. Cost-efficiency. Distributed computations, like PySpark's, are expensive. Reusing computations saves money, especially in environments like Azure or GCC where you may have to pay for CPU usage.
2. Time-efficiency. Obviously, you save time by not having to redo computations but by using stored results.
3. Exexcution-efficiency: Again, quite obviously, not having to redo computation means the execution will be more efficient and you can perform more jobs per cluster.

The easiest way to see the advantages of caching results is with an example. Take a function that tell you the n-th Fibonacci number: 

In [None]:
def fib(n: int) -> int:
    return n if n < 2 else fib(n - 1) + fib(n - 2)


%time fib(40)

As you will see if you run the cell below, the time difference is enormous. We will move from a runtime of seconds to one of microseconds. 

In [None]:
# lru = least recently used
@lru_cache
def fib(n: int) -> int:
    return n if n < 2 else fib(n - 1) + fib(n - 2)


%time fib(40)

## Window functions
Coming from SQL a window function, also known as an analytical function, is a function that uses the value from one or more rows to return a value for each row; the rows keep their identity. This is in contrast to the aggregate function, which returns a single value for multiple rows. 

From the [PostgresSQL](https://www.postgresql.org/docs/current/tutorial-window.html) documentation comes the following example:

`SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;`

Windows functions have an `over` clause and a `partion by` clause. The `partion by` divides the rows into mutually exclusive sets of rows. In the above example, the sets are made on `depname`. Over those subsets, we apply the agg function `avg`. In essence a windows function is no different than regular SQL function, but instead of applying to all the rows, we apply them to subsets of the rows. Resulting in the following combination:

  depname  | empno | salary |          avg
-----------|-------|--------|-----------------------
 develop   |    11 |   5200 | 5020.0000000000000000
 develop   |     7 |   4200 | 5020.0000000000000000
 develop   |     9 |   4500 | 5020.0000000000000000
 develop   |     8 |   6000 | 5020.0000000000000000
 develop   |    10 |   5200 | 5020.0000000000000000
 personnel |     5 |   3500 | 3700.0000000000000000
 personnel |     2 |   3900 | 3700.0000000000000000
 sales     |     3 |   4800 | 4866.6666666666666667
 sales     |     1 |   5000 | 4866.6666666666666667
 sales     |     4 |   4800 | 4866.6666666666666667


A window function follows the same split-apply-combine pattern we have seen before. In PySpark, we need to have a WindowSpec object that specifies how we partition our DataFrame into N mutually exclusive sets of rows.

In [None]:
path: str = (
    "./ProgrammingProjects/SparkTest/DataAnalysisWithPythonAndPySpark-Data-trunk/"
)

gsod: DataFrame = (
    reduce(
        lambda x, y: x.unionByName(y, allowMissingColumns=True),
        [
            spark.read.parquet(f"{path}gsod_noaa/gsod{year}.parquet")
            for year in range(2010, 2021)
        ],
    )
    .dropna(subset=["year", "mo", "da", "temp"])
    .where(F.col("temp") != 9999.9)
    .drop("date")
)

I am using a Pandas user defined function (UDF) here to bring the results to a European scale that we understand more intuitively. I discuss user defined functions in the RDD, UDF, and Pandas notebook.

In [None]:
@F.pandas_udf("double")
def fahrenheit_to_celsius(degrees: pd.Series) -> pd.Series:
    """converts degrees in Fahrenheit to Celsius"""
    return round((degrees - 32) * 5 / 9, 1)

#### summarize-join-approach

What we can do with window functions, create mutually exclusive sets of rows, apply a function to 
these rows and combine them back together, we can do without a window function. See the example below. 

In [None]:
coldest_temp: DataFrame = gsod.groupBy("year").agg(
    F.min(fahrenheit_to_celsius("temp")).alias("temp")
)
coldest_temp.orderBy("temp").show()

In [None]:
coldest_when: DataFrame = gsod.join(
    other=coldest_temp, on=["year", "temp"], how="left_semi"
).select("stn", "year", "mo", "da", "temp")
coldest_when.orderBy("year", "mo", "da").show()

#### self-join
This join is a bit odd; we first create a subset of our GSOD data coldest_temp and then join it with GSOD (which contains the subset) again. This type of self-join is considered an anti-pattern (bad programming practice) in data manipulation. Though obviously technically possible and not per se wrong, it is wieldy, slow, and feels not right.

You can solve this better using a window function; this will require less code and will be faster, first step is to get `WindowSpec` object, with which we can partition our DataFrame into mutually exclusive sets of rows.

In [None]:
each_year: WindowSpec = Window.partitionBy("year")
each_year

#### split-apply-combine revisited
We first need to create a specification for a windows function.

1. Split: The `WindowSpec` object, will tell the window function to spit the DataFrame in mutually exclusive sets of rows according to the values in the year column.
2. Apply: We apply the agg `min` function [`over`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.over.html?highlight=over#pyspark.sql.Column.over) our column.
3. Combine: the combining function is implicit. You do not need to code that part, if you observe the feedback Spark gives you will observe the combining.

In [None]:
wf = (
    gsod.withColumn(
        colName="min_temp", col=F.min(fahrenheit_to_celsius("temp")).over(each_year)
    )
    .where("temp = min_temp")
    .select("stn", "year", "mo", "da", "temp")
    .orderBy("year", "mo", "da")
)
wf.show()

#### Mutually exclusive sets
A DataFrame has partitions. In PySpark, these partitions are the physical splits of the data on each executor node. Partitioning itself means no more than a segmentation of, in our case, rows into mutually exclusive sets of rows. I do not think the vocabulary should be confusing; after all, the DataFrame partitioning means the same. Creating sets of rows that are mutually exclusive. The exclusivity being the executor nodes the set runs on. However, sometimes it is clarifying to have more names for the same idea: partion = mutually exclusive set.

We can create our mutually exclusive sets based on more than one attribute.

In [None]:
hottest_day: WindowSpec = Window.partitionBy("year", "mo", "da")

(
    gsod.select(
        "stn",
        "year",
        "mo",
        "da",
        "temp",
        F.max(fahrenheit_to_celsius("temp")).over(hottest_day).alias("max_temp"),
    )
    .filter(F.col("temp") == F.col("max_temp"))
    .drop("temp")
).show()

## Ranking functions
Ranking functions do exactly what you would expect them to do: rank records based upon some value of a field. There are several distinctive ranking functions:

1. `rank`: non-consecutive ranks
2. `dense_rank`: consecutive ranks
3. `percent_rank`: percentille ranks
4. `ntile`: tiles
5. `row_number`: return the row number

To be able to rank we need to order by, which we can do in SQL like so:

```
SELECT depname, empno, salary,
       rank() OVER (PARTITION BY depname ORDER BY salary DESC)
FROM empsalary;
```
In PySpark we need the WindowSpec object to do the splitting, the same object can also do the ordering. 

In [None]:
temp_per_month_asc: WindowSpec = Window.partitionBy("mo").orderBy("count_temp")

To keep things running on a single PC the book provides a smaller GSOD dataset.

In [None]:
path: str = (
    "./ProgrammingProjects/SparkTest/DataAnalysisWithPythonAndPySpark-Data-trunk/window/"
)

gsod_light: DataFrame = spark.read.parquet(path + "gsod_light.parquet")
gsod_light.show()

In [None]:
gsod_light.withColumn("rank_tmp", F.rank().over(temp_per_month_asc)).show()

In [None]:
temp_each_year: WindowSpec = each_year.orderBy("temp")
gsod_light.withColumn("rank_tmp", F.percent_rank().over(temp_each_year)).show()

The above result needs some explanation. If we take the second to last example. This record has two 
records in 2019 with a value < 44.7 with a total of four records, in the partion. The formula for calculating the percentile rank is:

$\frac{\text{records with a lower value than the current one}}{\text{number of records in the window - 1}}$

or 

$\frac{2}{4-1} = 0.667$

Percentile rank is useful if you want to reflect where a value stands in comparisson to its peers in the window. 

We can ask `row_number` if we want to know where the records sits in the window.

In [None]:
gsod_light.withColumn("rank_tmp", F.row_number().over(temp_each_year)).show()

There is no ascending parameter in the orderBy method on a Window, unlike the orderBy method of a 
DataFrame. See the [documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.orderBy.html#pyspark.sql.Window.orderBy). If you want a reversed order you will have to include that in your specification. 

In [None]:
temp_per_month_desc: WindowSpec = Window.partitionBy("mo").orderBy(
    F.col("count_temp").desc()
)

gsod_light.withColumn("rank_tmp", F.row_number().over(temp_per_month_desc)).show()

## analytic functions
Of course there more analytical functions that the rank functions. Analytic functions calculate an aggregate value based on a set of rows. Unlike aggregate functions, analytic functions can return multiple rows for each window. We use analytic functions to compute moving averages, running totals, percentages or top-N results within a group.

Examples of analytical functions are:

- `lag`: Accesses data from a previous row in the same result set
- `lead`: Accesses data from a subsequent row in the same result set 
- `cume_dist`: Calculates the relative position of a specified value in a group of values. Assuming ascending ordering, the cumelative distance of a value in row r is defined as: $\frac{\text{the number of rows with values less than or equal to that value in row r}}{\text{the number of rows evaluated in the partition}}$

As we can see from the formula, cummulative distance is closely related to rank precentile. 


In [None]:
gsod_light.withColumn("previous_temp", F.lag("temp").over(temp_each_year)).show()

In [None]:
gsod_light.withColumn("rank_tmp", F.percent_rank().over(temp_each_year)).withColumn(
    "cumulative_distance", F.cume_dist().over(temp_each_year)
).show()

#### Window Frames
Consider the code below, we use two average function one on an unordered window, one on ordered window. We get a different result, which is unexpected as ordering should not make a difference on `avg`. Yet, looking at the result columns we see there is a problem using the ordered window.  

In [None]:
not_ordered: WindowSpec = Window.partitionBy("year")
ordered: WindowSpec = not_ordered.orderBy("temp")

gsod_light.withColumn("avg_NO", F.avg("temp").over(not_ordered)).withColumn(
    "avg_O", F.avg("temp").over(ordered)
).show(n=5)

#### Two main types of window frames
When you look at the documentation for [Window](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.html?highlight=window#pyspark.sql.Window) you find that the class creates the Window with one of two frames:

1. An unbounded window frame if ordering is not applied; `rowframe` 
2. A growing window frame if ordering is applied; `rangeFrame`

definition: window frame; the boundaries of a window.

| year | mo | da | temp | boundary                |
|------|----|----|------|-------------------------|
| 2018 | 02 | 21 | 16.1 | Window.unboundPreceding |
| 2019 | 04 | 19 | 20.6 | ...                     |
| 2019 | 06 | 12 | 23.4 | -1 (row before current) |
| 2019 | 06 | 20 | 27.8 | Window.currentRow       |
| 2019 | 06 | 30 | 25.8 | +1 (row after current)  |
| 2019 | 07 | 19 | 24.7 | ...                     |
| 2019 | 08 | 17 | 20.8 | Window.unboundFollowing |

In the above window, we can move up to our first row, marked with the unboundPreceding attribute. We can move down the window until we come to the unboundFollowing attribute. The current row has the values: zero, one above -1, and one below 1. PySpark will assign useful values to both the unboundPreceding and unboundFollowing attributes.

Knowing this, we can add boundaries to our windows specification and solve our conundrum.

In [None]:
not_ordered: WindowSpec = Window.partitionBy("year").rowsBetween(
    Window.unboundedPreceding, Window.unboundedFollowing
)
ordered: WindowSpec = not_ordered.orderBy("temp").rangeBetween(
    Window.unboundedPreceding, Window.unboundedFollowing
)

In [None]:
gsod_light.withColumn("avg_NO", F.avg("temp").over(not_ordered)).withColumn(
    "avg_O", F.avg("temp").over(ordered)
).show(n=5)

It is simple to determine if you need an ordered or unordered window. Aggregation functions like `sum` or `avg` do not care about the order of things. Ranking and analytical functions as `lead` or `lag` depend on ordered windows. 

In [None]:
gsod_light_p: DataFrame = (
    gsod_light.withColumn("year", F.lit(2019))
    .withColumn(
        "dt", F.to_date(F.concat_ws("-", F.col("year"), F.col("mo"), F.col("da")))
    )
    .withColumn("dt_num", F.unix_timestamp("dt"))
)

In [None]:
gsod_light_p.show(n=5)

To be able to compare the temperature of a given day with that of the average of the previous month
and the month after, gsod_light needed to undergo some transformations. A date column `dt` was added. To create a number from that date for easy counting, `dt_num` was added as a Unix timestamp from the date. The Unix timestamp counts the number of seconds lapsed since January 1, 1970. To approximate the month preceding or following, we just need to calculate `60*60*24*30` 

In [None]:
month_in_sec: int = 60 * 60 * 24 * 30
one_month_before_and_after: WindowSpec = (
    Window.partitionBy("year").orderBy("dt_num").rangeBetween(-month_in_sec, month_in_sec)
)

In [None]:
gsod_light_p.withColumn(
    "avg_count", F.avg("count_temp").over(one_month_before_and_after)
).show()

#### Range vs. rows

For each record in the window, Spark computes the range boundaries based on the current row value from the field dt_num and determines the actual window the function will aggregate over. In effect, narrowing or growing the windows. If you use ranges, the actual value of a row is used, not the row number. You make the window respect the context you are applying it over.

We know six types of window types in PySpark:

 - Rows bounded; the window stays the same size and moves from record to record. The window is based on the position of the row. Numerical values are relative to the position of the current row.
 - Range bounded; the window stays the same size and moves from record to record. The window is based on the value of each row (e.g., 1555538400 in the above example). Numerical values are relative to the value of the current row (e.g., 1555536400 in the above example is one value from the current row).
 - Rows growing; the window grows and shriks in the direction in which it is not bound. The window is based on the position of the row. Numerical values are relative to the position of the current row.
 - Range bounded; the window grows and shriks in the direction in which it is not bounded. The window is based on the value of each row (e.g., 1555538400 in the above example). Numerical values are relative to the value of the current row. 1555536400 in the above example is one value from the current row if the window is unbounded relative to `Window.unboundPreceding`.
 - Rows unbounded; the window contains the whole partition. It stays the same for every record in the partition. The window stays the same size and moves from record to record. The window is based on the position of the row. Numerical values are relative to the position of the current row.
 - Range unbounded; the window contains the whole partition. It stays the same for every record in the partition. The window is based on the value of each row (e.g., 1555538400 in the above example). Numerical values are relative to the value of the current row (e.g., 1555536400 in the above example is one value from the current row).

In [None]:
@F.pandas_udf("double")
def median(vals: pd.Series) -> float:
    return vals.median() 

spec: WindowSpec = Window.partitionBy("year")

gsod_light.withColumn(
    "median_temp", median("temp").over(spec)).withColumn(
        "median_temp_bounded", median(F.col("temp")).over(spec.orderBy("mo", "da")) 
).show()

As you can see, we can do a window with our own defined function, or in this case, the `median` 
function from Pandas. More on this in the next notebook on RDDs and user-defined functions.

#### The main steps to a successful window function are:

1. What kind of operation do I want to perform? Summarise, rank, or look ahead/behind.
2. How do I construct my window? Should it be bounded or unbounded? In other words, do I need every record to have the same window value (unbounded) or should the answer depend on where the record fits in the window (bounded)? When bounding a window frame more often than not, you often order it as well.
3. If you have a bounded window, do you want to bind the window according to the position of the record (row based) or the value of the record (range based)
4. A window function does not make your DataFrame special; you can still filter it, use a group by, or apply another different window function.