## Overview of Windowing Functions

Let us get an overview of Windowing Functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Windowing Functions'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

 * First let us understand relevance of these functions using `employees` data set.

In [None]:
employeesPath = '/public/hr_db/employees'

In [None]:
employees = spark. \
    read. \
    format('csv'). \
    option('sep', '\t'). \
    schema('''employee_id INT, 
              first_name STRING, 
              last_name STRING, 
              email STRING,
              phone_number STRING, 
              hire_date STRING, 
              job_id STRING, 
              salary FLOAT,
              commission_pct STRING,
              manager_id STRING, 
              department_id STRING
            '''). \
    load(employeesPath)

In [None]:
from pyspark.sql.functions import col
employees. \
    select('employee_id', 
           col('department_id').cast('int').alias('department_id'), 
           'salary'
          ). \
    orderBy('department_id', 'salary'). \
    show()

* Let us say we want to compare individual salary with department wise salary expense.
* Here is one of the approach which require self join.
  * Compute department wise expense usig `groupBy` and `agg`.
  * Join with **employees** again on department_id.

In [None]:
from pyspark.sql.functions import sum, col

In [None]:
department_expense = employees. \
    groupBy('department_id'). \
    agg(sum('salary').alias('expense'))

In [None]:
department_expense.show()

In [None]:
employees. \
    select('employee_id', 'department_id', 'salary'). \
    join(department_expense, employees.department_id == department_expense.department_id). \
    orderBy(employees.department_id, col('salary')). \
    show()

 **However, using this approach is not very efficient and also overly complicated. Windowing functions actually simplify the logic and also runs efficiently**
 
Now let us get into the details related to Windowing functions.
 * Main package `pyspark.sql.window`
 * It has classes such as `Window` and `WindowSpec`
 * `Window` have APIs such as `partitionBy`, `orderBy` etc
 * These APIs (such as `partitionBy`) return `WindowSpec` object. We can pass `WindowSpec` object to over on functions such as `rank()`, `dense_rank()`, `sum()` etc
 * Syntax: `sum().over(spec)` where `spec = Window.partitionBy('ColumnName')`

| Functions        | API or Function      |
| ------------- |:-------------:|
| Aggregate Functions      | <ul><li>sum</li><li>avg</li><li>min</li><li>max</li></ul> |
| Ranking Functions      | <ul><li>rank</li><li>dense_rank</li></ul><ul><li>percent_rank</li><li>row_number</li> <li>ntile</li></ul> |
| Analytic Functions      | <ul><li>cume_dist</li><li>first</li><li>last</li><li>lead</li> <li>lag</li></ul> |