In [None]:
PySpark Window functions are used to calculate results, such as the rank, row number, etc., over a range of input rows.
PySpark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. 
PySpark SQL supports three kinds of window functions:
1.ranking functions
2.analytic functions
3.aggregate functions

In [None]:
To operate on a group, first, we need to partition the data using Window.partitionBy() , and for row number and rank function, 
we need to additionally order by on partition data using orderBy clause.

Window Ranking functions, like row_number(), rank(), and dense_rank(), assign sequential numbers to DataFrame rows based on
specified criteria within defined partitions. These functions enable sorting and ranking operations, identifying row positions 
in partitions based on specific orderings.

The row_number() assigns unique sequential numbers, rank() provides the ranking with gaps, and dense_rank() offers ranking 
without gaps. They’re valuable in selecting top elements within groups and bottom elements within groups, facilitating analysis
of data distributions, and identifying the highest or lowest values within partitions in PySpark DataFrames.

row_number() window function gives the sequential row number starting from 1 to the result of each window partition.

rank() window function provides a rank to the result within a window partition. This function leaves gaps in rank when 
there are ties.

dense_rank() window function is used to get the result with rank of rows within a window partition without any gaps. This is
similar to rank() function difference being rank function leaves gaps in rank when there are ties.

In [None]:
ntile() window function returns the relative rank of result rows within a window partition. In the below example we have
used 2 as an argument to ntile hence it returns ranking between 2 values (1 and 2)

Window Analytic Functions

cume_dist(): This function computes the cumulative distribution of a value within a window partition. It calculates the relative
rank of a value within the partition. The result ranges from 0 to 1, where a value of 0 indicates the lowest value in the 
partition, and 1 indicates the highest. It’s useful for understanding the distribution of values compared to others within the 
same partition.This is the same as the DENSE_RANK function in SQL.

The lag() function allows you to access a previous row’s value within the partition based on a specified offset. 
It retrieves the column value from the previous row, which can be helpful for comparative analysis or calculating differences 
between consecutive rows.

The lead() function retrieves the column value from the following row within the partition based on a specified offset. 
It helps in accessing subsequent row values for comparison or predictive analysis.

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

simpleData = (("James", "Sales", 3000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
 
columns= ["employee_name", "department", "salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()
df.show(truncate=False)

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec  = Window.partitionBy("department").orderBy("salary")

df.withColumn("row_number",row_number().over(windowSpec)) \
    .show(truncate=False)

from pyspark.sql.functions import rank
df.withColumn("rank",rank().over(windowSpec)) \
    .show()

from pyspark.sql.functions import dense_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec)) \
    .show()

from pyspark.sql.functions import percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec)) \
    .show()

#Window Analytic Functions
from pyspark.sql.functions import ntile
df.withColumn("ntile",ntile(2).over(windowSpec)) \
    .show()

from pyspark.sql.functions import cume_dist    
df.withColumn("cume_dist",cume_dist().over(windowSpec)) \
   .show()

from pyspark.sql.functions import lag    
df.withColumn("lag",lag("salary",2).over(windowSpec)) \
      .show()

from pyspark.sql.functions import lead    
df.withColumn("lead",lead("salary",2).over(windowSpec)) \
    .show()
    
#Window Aggregate Functions
windowSpecAgg  = Window.partitionBy("department")
from pyspark.sql.functions import col,avg,sum,min,max,row_number 
df.withColumn("row",row_number().over(windowSpec)) \
  .withColumn("avg", avg(col("salary")).over(windowSpecAgg)) \
  .withColumn("sum", sum(col("salary")).over(windowSpecAgg)) \
  .withColumn("min", min(col("salary")).over(windowSpecAgg)) \
  .withColumn("max", max(col("salary")).over(windowSpecAgg)) \
  .where(col("row")==1).select("department","avg","sum","min","max") \
  .show()

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4100  |
|Maria        |Finance   |3000  |
|James        |Sales     |3000  |
|Scott        |Finance   |3300  |
|Jen          |Finance   |3900  |
|Jeff         |Marketing |3000  |
|Kumar        |Marketing |2000  |
|Saif         |Sales     |4100  |
+-------------+----------+------+

+-------------+----------+------+----------+
|employee_name|department|salary|row_number|
+-------------+----------+------+----------+
|Maria        |Finance   |3000  |1         |
|Scott        |Finance   |3300  |2         |
|Jen          |Finance   |3900  |3         |
|Kumar        |Marketing |2000  |1         |
|Jeff         |Marketing |3000  |2         |
|James        |Sales     |3000  |1