# 2 Window Function Performence issus

In previous section, we have explained that the window specification change the partition of the data frame. As a result the partition number of your data frame may not 
be optimal at all. For example, if your window specification partition a column which only has one unique value, this will cause the entire dataset to be shuffled to a single executor. If the executor's memory/disk can not hold the data, the job will fail with OOM errors.

If we use another column to do the partitionBy. The global order of the column that we want to sort will be lost.
So we need to partition the dataframe by conserving the global order of the the sort column. Which means all_values_in_sort_column(partition_0) < all_values_in_sort_column(partition_1) < all_values_in_sort_column(partition_2). And all values in one partition are sorted, so we keep the global order of the sorted column. 

Note in spark 3.1.2. only the same value will be put in the same partition. If column price has 12 distinct value, the dataframe after orderBy("price") will have 12 partition.

In [27]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import spark_partition_id, row_number, rank, dense_rank, col, lit, coalesce, broadcast, max, sum
from pyspark.sql.window import Window
import os

In [2]:
local=False

if local:
    spark=SparkSession.builder.master("local[4]").appName("WindowFunctionPerformenceIssus").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("WindowFunctionPerformenceIssus") \
                      .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1') \
                      .getOrCreate()
     

In [3]:
data = [('Alex', '2018-10-10', 'Paint', 5),
            ('Alex', '2018-04-02', 'Ladder', 10),
            ('Alex', '2018-06-22', 'Stool', 15),
            ('Alex', '2018-12-09', 'Vacuum', 20),
            ('Alex', '2018-07-12', 'Bucket', 20),
            ('Alex', '2018-02-18', 'Gloves', 20),
            ('Alex', '2018-03-03', 'Brushes', 20),
            ('Alex', '2018-09-26', 'Sandpaper', 30),
            ('Alex', '2018-12-09', 'Vacuum', 30),
            ('Alex', '2018-07-12', 'Bucket', 30),
            ('Alex', '2018-02-18', 'Gloves', 30),
            ('Alex', '2018-03-03', 'Brushes', 5),
            ('Alex', '2018-09-26', 'Sandpaper', 5)]

df = spark.createDataFrame(data,schema=['name','date','product','price'])
df.show()
df.printSchema()

+----+----------+---------+-----+
|name|      date|  product|price|
+----+----------+---------+-----+
|Alex|2018-10-10|    Paint|    5|
|Alex|2018-04-02|   Ladder|   10|
|Alex|2018-06-22|    Stool|   15|
|Alex|2018-12-09|   Vacuum|   20|
|Alex|2018-07-12|   Bucket|   20|
|Alex|2018-02-18|   Gloves|   20|
|Alex|2018-03-03|  Brushes|   20|
|Alex|2018-09-26|Sandpaper|   30|
|Alex|2018-12-09|   Vacuum|   30|
|Alex|2018-07-12|   Bucket|   30|
|Alex|2018-02-18|   Gloves|   30|
|Alex|2018-03-03|  Brushes|    5|
|Alex|2018-09-26|Sandpaper|    5|
+----+----------+---------+-----+

root
 |-- name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- product: string (nullable = true)
 |-- price: long (nullable = true)



In [10]:
# check the default partition of the dataframe
# You can notice that we have 4 partition, and row are evenly split into these four partition 
df.withColumn("partition_id",spark_partition_id()).show(truncate=False)

+----+----------+---------+-----+------------+
|name|date      |product  |price|partition_id|
+----+----------+---------+-----+------------+
|Alex|2018-10-10|Paint    |5    |0           |
|Alex|2018-04-02|Ladder   |10   |0           |
|Alex|2018-06-22|Stool    |15   |0           |
|Alex|2018-12-09|Vacuum   |20   |1           |
|Alex|2018-07-12|Bucket   |20   |1           |
|Alex|2018-02-18|Gloves   |20   |1           |
|Alex|2018-03-03|Brushes  |20   |2           |
|Alex|2018-09-26|Sandpaper|30   |2           |
|Alex|2018-12-09|Vacuum   |30   |2           |
|Alex|2018-07-12|Bucket   |30   |3           |
|Alex|2018-02-18|Gloves   |30   |3           |
|Alex|2018-03-03|Brushes  |5    |3           |
|Alex|2018-09-26|Sandpaper|5    |3           |
+----+----------+---------+-----+------------+



# 2.1 Example of What is wrong with window spec without partition

In [28]:
# now let's try to use a window function
# We first define an orderBy window specification without partitionBy
# You can notice all rows are in the same partition 0 which will be shuffled to the same executor.
# If you have millions of rows in the same executor, you will get an OOM error

window_with_only_order= Window.orderBy("price")
df1=df.withColumn("row_number",row_number().over(window_with_only_order))\
      .withColumn("rank",dense_rank().over(window_with_only_order)) \
      .withColumn("partition_id",spark_partition_id())
# note the entire dataframe only has 1 partition. Because our window spec does not have partitionBy
df1.show()

+----+----------+---------+-----+----------+----+------------+
|name|      date|  product|price|row_number|rank|partition_id|
+----+----------+---------+-----+----------+----+------------+
|Alex|2018-10-10|    Paint|    5|         1|   1|           0|
|Alex|2018-03-03|  Brushes|    5|         2|   1|           0|
|Alex|2018-09-26|Sandpaper|    5|         3|   1|           0|
|Alex|2018-04-02|   Ladder|   10|         4|   2|           0|
|Alex|2018-06-22|    Stool|   15|         5|   3|           0|
|Alex|2018-03-03|  Brushes|   20|         6|   4|           0|
|Alex|2018-12-09|   Vacuum|   20|         7|   4|           0|
|Alex|2018-07-12|   Bucket|   20|         8|   4|           0|
|Alex|2018-02-18|   Gloves|   20|         9|   4|           0|
|Alex|2018-09-26|Sandpaper|   30|        10|   5|           0|
|Alex|2018-12-09|   Vacuum|   30|        11|   5|           0|
|Alex|2018-07-12|   Bucket|   30|        12|   5|           0|
|Alex|2018-02-18|   Gloves|   30|        13|   5|      

# 2.2 Slove the above problem

Now, we want to calculate the rank of price column, without separate them into different windows. In a small dataset we can directly use groupBy specification without partition. 
But with a big data set, we will have OOM error. 

To avoid this, we do the following

Step1: use orderBy("price") to partition the dataframe, create a column "partition_id" by using function spark_partition_id,

Step2: Create a local_rank column to rank each row inside its partition.

Step3: Get max rank of each partition by grouping the partition_id column

Step4: Use a window spec to get cumulative rank number for each partition

Step5: calculate the a sum factor to which can sum to local rank to become a global rank 

Step6: join the sum_factor with local rank and calculate the global rank

In [16]:
# Step 1:
# To partition the dataframe and conserving the global sort order, we can use orderBy("sort_column_name")
# In our case, the column name is price which we want to sort.
df_sort_part = df.orderBy("price").withColumn("partition_id", spark_partition_id())
df_sort_part.show(truncate=False)
df_sort_part.rdd.getNumPartitions()

+----+----------+---------+-----+------------+
|name|date      |product  |price|partition_id|
+----+----------+---------+-----+------------+
|Alex|2018-10-10|Paint    |5    |0           |
|Alex|2018-03-03|Brushes  |5    |0           |
|Alex|2018-09-26|Sandpaper|5    |0           |
|Alex|2018-04-02|Ladder   |10   |1           |
|Alex|2018-06-22|Stool    |15   |2           |
|Alex|2018-03-03|Brushes  |20   |3           |
|Alex|2018-12-09|Vacuum   |20   |3           |
|Alex|2018-07-12|Bucket   |20   |3           |
|Alex|2018-02-18|Gloves   |20   |3           |
|Alex|2018-07-12|Bucket   |30   |4           |
|Alex|2018-02-18|Gloves   |30   |4           |
|Alex|2018-09-26|Sandpaper|30   |4           |
|Alex|2018-12-09|Vacuum   |30   |4           |
+----+----------+---------+-----+------------+



6

In [20]:
# Step 2: Create a local_rank column to rank each row inside its partition.
# We create a window specification partitionBy the partition_id which is created by orderBy()
win_part_id = Window.partitionBy("partition_id").orderBy("price")
# does not work on cluster mode, the window function trigger a repartion to 200 partition
df_rank = df_sort_part.withColumn("local_rank", rank().over(win_part_id))
df_rank.show()
df_rank.rdd.getNumPartitions()

+----+----------+---------+-----+------------+----------+
|name|      date|  product|price|partition_id|local_rank|
+----+----------+---------+-----+------------+----------+
|Alex|2018-04-02|   Ladder|   10|           1|         1|
|Alex|2018-03-03|  Brushes|   20|           3|         1|
|Alex|2018-12-09|   Vacuum|   20|           3|         1|
|Alex|2018-07-12|   Bucket|   20|           3|         1|
|Alex|2018-02-18|   Gloves|   20|           3|         1|
|Alex|2018-07-12|   Bucket|   30|           4|         1|
|Alex|2018-02-18|   Gloves|   30|           4|         1|
|Alex|2018-09-26|Sandpaper|   30|           4|         1|
|Alex|2018-12-09|   Vacuum|   30|           4|         1|
|Alex|2018-06-22|    Stool|   15|           2|         1|
|Alex|2018-03-03|  Brushes|    5|           0|         1|
|Alex|2018-09-26|Sandpaper|    5|           0|         1|
|Alex|2018-10-10|    Paint|    5|           0|         1|
+----+----------+---------+-----+------------+----------+



200

In [13]:
 # Step 3: Get max rank of each partition by grouping the partition_id column
df_tmp = df_rank.groupBy("partition_id").agg(max("local_rank").alias("max_rank"))
    
df_tmp.show()

+------------+--------+
|partition_id|max_rank|
+------------+--------+
|           1|       1|
|           3|       1|
|           4|       1|
|           2|       1|
|           0|       1|
+------------+--------+



In [21]:
# Step 4: Use a window spec to get cumulative rank number for each partition
win_rank = Window.orderBy("partition_id").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df_stats = df_tmp.withColumn("cum_rank", sum("max_rank").over(win_rank))
df_stats.show()

+------------+--------+--------+
|partition_id|max_rank|cum_rank|
+------------+--------+--------+
|           0|       1|       1|
|           1|       1|       2|
|           2|       1|       3|
|           3|       1|       4|
|           4|       1|       5|
+------------+--------+--------+



In [24]:
# Step 5 calculate the a sum factor to which can sum to local rank to become a global rank
# tmp1 is a self join with the join condition, l.partition_id == r.partition_id +1
# this means we shift the cumulative sum by 1 row on the right data frame
tmp1 = df_stats.alias("l").join(df_stats.alias("r"), col("l.partition_id") == col("r.partition_id") + 1, "left")
tmp1.show()

join_df = tmp1.select(col("l.partition_id"), coalesce(col("r.cum_rank"), lit(0)).alias("sum_factor"))
join_df.show()

+------------+--------+--------+------------+--------+--------+
|partition_id|max_rank|cum_rank|partition_id|max_rank|cum_rank|
+------------+--------+--------+------------+--------+--------+
|           0|       1|       1|        null|    null|    null|
|           1|       1|       2|           0|       1|       1|
|           2|       1|       3|           1|       1|       2|
|           3|       1|       4|           2|       1|       3|
|           4|       1|       5|           3|       1|       4|
+------------+--------+--------+------------+--------+--------+

+------------+----------+
|partition_id|sum_factor|
+------------+----------+
|           1|         1|
|           3|         3|
|           4|         4|
|           2|         2|
|           0|         0|
+------------+----------+



In [32]:
# Step 6 join the sum_factor with local rank and calculate the global rank
df_final = df_rank.join(broadcast(join_df), "partition_id", "inner") \
        .withColumn("rank", col("local_rank") + col("sum_factor"))

df_final.orderBy("rank").show()
df_final.rdd.getNumPartitions()

+------------+----+----------+---------+-----+----------+----------+----+
|partition_id|name|      date|  product|price|local_rank|sum_factor|rank|
+------------+----+----------+---------+-----+----------+----------+----+
|           0|Alex|2018-10-10|    Paint|    5|         1|         0|   1|
|           0|Alex|2018-03-03|  Brushes|    5|         1|         0|   1|
|           0|Alex|2018-09-26|Sandpaper|    5|         1|         0|   1|
|           1|Alex|2018-04-02|   Ladder|   10|         1|         1|   2|
|           2|Alex|2018-06-22|    Stool|   15|         1|         2|   3|
|           3|Alex|2018-03-03|  Brushes|   20|         1|         3|   4|
|           3|Alex|2018-07-12|   Bucket|   20|         1|         3|   4|
|           3|Alex|2018-12-09|   Vacuum|   20|         1|         3|   4|
|           3|Alex|2018-02-18|   Gloves|   20|         1|         3|   4|
|           4|Alex|2018-07-12|   Bucket|   30|         1|         4|   5|
|           4|Alex|2018-09-26|Sandpape

200

In [33]:
# You can notice the above df has 200 partition. and df1 has one partition.
df1.show()
df1.rdd.getNumPartitions()

+----+----------+---------+-----+----------+----+------------+
|name|      date|  product|price|row_number|rank|partition_id|
+----+----------+---------+-----+----------+----+------------+
|Alex|2018-10-10|    Paint|    5|         1|   1|           0|
|Alex|2018-03-03|  Brushes|    5|         2|   1|           0|
|Alex|2018-09-26|Sandpaper|    5|         3|   1|           0|
|Alex|2018-04-02|   Ladder|   10|         4|   2|           0|
|Alex|2018-06-22|    Stool|   15|         5|   3|           0|
|Alex|2018-03-03|  Brushes|   20|         6|   4|           0|
|Alex|2018-12-09|   Vacuum|   20|         7|   4|           0|
|Alex|2018-07-12|   Bucket|   20|         8|   4|           0|
|Alex|2018-02-18|   Gloves|   20|         9|   4|           0|
|Alex|2018-07-12|   Bucket|   30|        10|   5|           0|
|Alex|2018-02-18|   Gloves|   30|        11|   5|           0|
|Alex|2018-09-26|Sandpaper|   30|        12|   5|           0|
|Alex|2018-12-09|   Vacuum|   30|        13|   5|      

1