## Overview of Windowing Functions

Let us get an overview of Windowing Functions.

 * First let us understand relevance of these functions using `employees` data set.

Let us start spark context for this Notebook so that we can execute the code provided.

If you want to use terminal for the practice, here is the command to use.

```
spark2-shell \
  --master yarn \
  --name "Joining Data Sets" \
  --conf spark.ui.port=0
```

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Windowing Functions").
    master("yarn").
    getOrCreate()

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "2")

In [None]:
import spark.implicits._

In [None]:
val employeesPath = "/public/hr_db/employees"

In [None]:
val employees = spark.
    read.
    format("csv").
    option("sep", "\t").
    schema("""employee_id INT, 
              first_name STRING, 
              last_name STRING, 
              email STRING,
              phone_number STRING, 
              hire_date STRING, 
              job_id STRING, 
              salary FLOAT,
              commission_pct STRING,
              manager_id STRING, 
              department_id STRING
            """).
    load(employeesPath)

In [None]:
import org.apache.spark.sql.functions.col

In [None]:
employees.
    select($"employee_id", 
           $"department_id".cast("int").alias("department_id"), 
           $"salary"
          ).
    orderBy("department_id", "salary").
    show

* Let us say we want to compare individual salary with department wise salary expense.
* Here is one of the approach which require self join.
  * Compute department wise expense usig `groupBy` and `agg`.
  * Join with **employees** again on department_id.

In [None]:
import org.apache.spark.sql.functions.{sum, col}

In [None]:
val department_expense = employees.
    groupBy("department_id").
    agg(sum("salary").alias("expense"))

In [None]:
department_expense.show

In [None]:
employees.
    select("employee_id", "department_id", "salary").
    join(department_expense, employees("department_id") === department_expense("department_id")).
    orderBy(employees("department_id"), $"salary").
    show

 **However, using this approach is not very efficient and also overly complicated. Windowing functions actually simplify the logic and also runs efficiently**
 
Now let us get into the details related to Windowing functions.
 * Main package `org.apache.spark.sql.expressions`
 * It has classes such as `Window` and `WindowSpec`
 * `Window` have APIs such as `partitionBy`, `orderBy` etc
 * These APIs (such as `partitionBy`) return `WindowSpec` object. We can pass `WindowSpec` object to over on functions such as `rank()`, `dense_rank()`, `sum()` etc
 * Syntax: `sum().over(spec)` where `spec = Window.partitionBy("ColumnName")`

| Functions        | API or Function      |
| ------------- |:-------------:|
| Aggregate Functions      | <ul><li>sum</li><li>avg</li><li>min</li><li>max</li></ul> |
| Ranking Functions      | <ul><li>rank</li><li>dense_rank</li></ul><ul><li>percent_rank</li><li>row_number</li> <li>ntile</li></ul> |
| Analytic Functions      | <ul><li>cume_dist</li><li>first</li><li>last</li><li>lead</li> <li>lag</li></ul> |