# Window Function

PySpark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. PySpark SQL supports three kinds of window functions: 

- Ranking functions
- Analytics functions
- Aggregate Functions

We need to parition the data before applying any aggregate functions, with
`Window.partitionBy()`

For `row number` and `rank function` additionally we need use `orderBy` clause.


In [1]:
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql import Row
from numpy.random import rand
from pyspark.sql.types import IntegerType, StringType

My machine has following configuration...
- 6 cores with 12vCores
- 32GB RAM

Spark Standalone server:
```
cd /opt/softwares/spark-3.0.1-bin-hadoop3.2/

export PYSPARK_PYTHON=/opt/envs/ai4e/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/envs/ai4e/bin/python

sbin/start-all.sh
sbin/stop-all.sh
```
Spark UI: [http://localhost:8080](http://localhost:8080)   
Spark Master URL : spark://IMCHLT276:7077

In [2]:
spark = SparkSession.builder \
    .master("spark://IMCHLT276:7077") \
    .config("spark.sql.autoBroadcastJoinThreshold", -1) \
    .config("spark.executor.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.cores.max", "6") \
    .config("spark.local.dir", "/opt/tmp/spark-temp/") \
    .appName("DataSkewness") \
    .getOrCreate()

In [3]:
data = (("James", "Sales", 3000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
 
columns= ["employee_name", "department", "salary"]

In [4]:
df = spark.createDataFrame(data, schema=columns)
df.show()

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+



In [5]:
window_spec = Window.partitionBy('department').orderBy('salary')

## Analytic functions

**row_number**

In [6]:
from pyspark.sql.functions import row_number
df.withColumn("row_number", row_number().over(window_spec)).show()

+-------------+----------+------+----------+
|employee_name|department|salary|row_number|
+-------------+----------+------+----------+
|        James|     Sales|  3000|         1|
|        James|     Sales|  3000|         2|
|       Robert|     Sales|  4100|         3|
|         Saif|     Sales|  4100|         4|
|      Michael|     Sales|  4600|         5|
|        Maria|   Finance|  3000|         1|
|        Scott|   Finance|  3300|         2|
|          Jen|   Finance|  3900|         3|
|        Kumar| Marketing|  2000|         1|
|         Jeff| Marketing|  3000|         2|
+-------------+----------+------+----------+



**rank**

In [7]:
from pyspark.sql.functions import rank
df.withColumn("rank", rank().over(window_spec)).show()

+-------------+----------+------+----+
|employee_name|department|salary|rank|
+-------------+----------+------+----+
|        James|     Sales|  3000|   1|
|        James|     Sales|  3000|   1|
|       Robert|     Sales|  4100|   3|
|         Saif|     Sales|  4100|   3|
|      Michael|     Sales|  4600|   5|
|        Maria|   Finance|  3000|   1|
|        Scott|   Finance|  3300|   2|
|          Jen|   Finance|  3900|   3|
|        Kumar| Marketing|  2000|   1|
|         Jeff| Marketing|  3000|   2|
+-------------+----------+------+----+



**dense_rank**

In [8]:
from pyspark.sql.functions import dense_rank
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()

+-------------+----------+------+----------+
|employee_name|department|salary|dense_rank|
+-------------+----------+------+----------+
|        James|     Sales|  3000|         1|
|        James|     Sales|  3000|         1|
|         Saif|     Sales|  4100|         2|
|       Robert|     Sales|  4100|         2|
|      Michael|     Sales|  4600|         3|
|        Maria|   Finance|  3000|         1|
|        Scott|   Finance|  3300|         2|
|          Jen|   Finance|  3900|         3|
|        Kumar| Marketing|  2000|         1|
|         Jeff| Marketing|  3000|         2|
+-------------+----------+------+----------+



**cume_dist**

In [9]:
from pyspark.sql.functions import cume_dist    
df.withColumn("cume_dist",cume_dist().over(window_spec)) \
   .show()

+-------------+----------+------+------------------+
|employee_name|department|salary|         cume_dist|
+-------------+----------+------+------------------+
|        James|     Sales|  3000|               0.4|
|        James|     Sales|  3000|               0.4|
|       Robert|     Sales|  4100|               0.8|
|         Saif|     Sales|  4100|               0.8|
|      Michael|     Sales|  4600|               1.0|
|        Maria|   Finance|  3000|0.3333333333333333|
|        Scott|   Finance|  3300|0.6666666666666666|
|          Jen|   Finance|  3900|               1.0|
|        Kumar| Marketing|  2000|               0.5|
|         Jeff| Marketing|  3000|               1.0|
+-------------+----------+------+------------------+



**lag**

In [10]:
from pyspark.sql.functions import lag    
df.withColumn("lag",lag("salary",2).over(window_spec)) \
      .show()

+-------------+----------+------+----+
|employee_name|department|salary| lag|
+-------------+----------+------+----+
|        James|     Sales|  3000|null|
|        James|     Sales|  3000|null|
|         Saif|     Sales|  4100|3000|
|       Robert|     Sales|  4100|3000|
|      Michael|     Sales|  4600|4100|
|        Maria|   Finance|  3000|null|
|        Scott|   Finance|  3300|null|
|          Jen|   Finance|  3900|3000|
|        Kumar| Marketing|  2000|null|
|         Jeff| Marketing|  3000|null|
+-------------+----------+------+----+



**lead**

In [11]:
 """lead"""
from pyspark.sql.functions import lead    
df.withColumn("lead",lead("salary",2).over(window_spec)) \
    .show()

+-------------+----------+------+----+
|employee_name|department|salary|lead|
+-------------+----------+------+----+
|        James|     Sales|  3000|4100|
|        James|     Sales|  3000|4100|
|       Robert|     Sales|  4100|4600|
|         Saif|     Sales|  4100|null|
|      Michael|     Sales|  4600|null|
|        Maria|   Finance|  3000|3900|
|        Scott|   Finance|  3300|null|
|          Jen|   Finance|  3900|null|
|        Kumar| Marketing|  2000|null|
|         Jeff| Marketing|  3000|null|
+-------------+----------+------+----+



 ## Aggregate Functions
 
 When working with Aggregate functions, we don’t need to use order by clause. 

In [12]:
window_spec_agg  = Window.partitionBy("department")

In [13]:
df.withColumn("row",row_number().over(window_spec)) \
  .withColumn("avg", F.avg(F.col("salary")).over(window_spec_agg)) \
  .withColumn("sum", F.sum(F.col("salary")).over(window_spec_agg)) \
  .withColumn("min", F.min(F.col("salary")).over(window_spec_agg)) \
  .withColumn("max", F.max(F.col("salary")).over(window_spec_agg)) \
  .show()

+-------------+----------+------+---+------+-----+----+----+
|employee_name|department|salary|row|   avg|  sum| min| max|
+-------------+----------+------+---+------+-----+----+----+
|        James|     Sales|  3000|  1|3760.0|18800|3000|4600|
|        James|     Sales|  3000|  2|3760.0|18800|3000|4600|
|         Saif|     Sales|  4100|  3|3760.0|18800|3000|4600|
|       Robert|     Sales|  4100|  4|3760.0|18800|3000|4600|
|      Michael|     Sales|  4600|  5|3760.0|18800|3000|4600|
|        Maria|   Finance|  3000|  1|3400.0|10200|3000|3900|
|        Scott|   Finance|  3300|  2|3400.0|10200|3000|3900|
|          Jen|   Finance|  3900|  3|3400.0|10200|3000|3900|
|        Kumar| Marketing|  2000|  1|2500.0| 5000|2000|3000|
|         Jeff| Marketing|  3000|  2|2500.0| 5000|2000|3000|
+-------------+----------+------+---+------+-----+----+----+



In [14]:
df.withColumn("row",row_number().over(window_spec)) \
  .withColumn("avg", F.avg(F.col("salary")).over(window_spec_agg)) \
  .withColumn("sum", F.sum(F.col("salary")).over(window_spec_agg)) \
  .withColumn("min", F.min(F.col("salary")).over(window_spec_agg)) \
  .withColumn("max", F.max(F.col("salary")).over(window_spec_agg)) \
  .where(F.col("row")==1).select(["department","avg","sum","min","max"]) \
  .show()

+----------+------+-----+----+----+
|department|   avg|  sum| min| max|
+----------+------+-----+----+----+
|     Sales|3760.0|18800|3000|4600|
|   Finance|3400.0|10200|3000|3900|
| Marketing|2500.0| 5000|2000|3000|
+----------+------+-----+----+----+

