In [1]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

SHUFFLE_PARTITIONS = 5

In [2]:
spark = (SparkSession
         .builder
         .master("local[*]")
         .appName("salaries")
         .config("spark.sql.adaptive.enabled", "false")
         .config("spark.sql.shuffle.partitions", SHUFFLE_PARTITIONS)
         .getOrCreate()
         )

your 131072x1 screen size is bogus. expect trouble
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/15 03:49:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/15 03:49:50 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Reading employee and salary data.

In [3]:
df_employees = (spark
                .read
                .option("header", True)
                .option("delimiter", ";")
                .option("inferSchema", True)
                .csv("inputs/employees_10000.csv")
                )

df_salaries = (spark
               .read
               .option("header", True)
               .option("delimiter", ";")
               .option("inferSchema", True)
               .csv("inputs/salaries_10000.csv")
               )

### Demonstration of poor distribution of data between partitions

Joining and demonstrating employee and salary data.

In [4]:
df_unbalanced = (df_employees
                 .join(df_salaries, on="salary_id", how="inner")
                 .select("id", "salary_id", "name", "department", "salary"))

df_unbalanced.show()

+---+---------+--------------------+-----------+--------+
| id|salary_id|                name| department|  salary|
+---+---------+--------------------+-----------+--------+
|  1|        1|    Frederico Santos|  Mercearia|23708.05|
|  2|        2|Alessandra Noguei...|     Livros|21482.94|
|  3|        3|  Dalila Xavier Neto|     Música|15480.98|
|  4|        4|           Davi Reis|     Jardim|11900.67|
|  5|        5|     Warley Reis Jr.|     Música| 6097.59|
|  6|        6|Srta. Maria Luiza...|Eletrônicos| 6501.28|
|  7|        7|     Ricardo Batista|     Beleza| 5151.14|
|  8|        8|Srta. Melissa Xavier|   Crianças| 2360.42|
|  9|        9|         Kléber Reis|   Crianças| 7841.75|
| 10|       10|         Silas Silva|    Sapatos|17572.83|
| 11|       11|       Melissa Souza|     Roupas|25391.04|
| 12|       12|        João Batista|      Jogos|  7907.0|
| 13|       13|       Roberto Souza|     Jardim|21442.78|
| 14|       14|      Helena Saraiva|      Jogos| 6897.37|
| 15|       15

Demonstration of partition imbalance, with all data in 1 partition of 5 defined in `spark.sql.shuffle.partitions`.

In [5]:
(df_unbalanced
 .withColumn("partition_id", F.spark_partition_id())
 .groupBy("partition_id")
 .count()
 .show()
 )

+------------+-----+
|partition_id|count|
+------------+-----+
|           0|10000|
+------------+-----+



### Demonstration of the effects of applying the salt hash join technique

We created a new DataFrame based on df_employees, adding the "salt_id" column, composed of the concatenation of the employee's id with a random value between 0 and 4.

Next, the DataFrame is repartitioned using the "salt_id" column, giving us better balance between partitions due to the randomness factor.

In [6]:
salt_col = F.concat_ws("_", F.col("salary_id"), (F.rand() * 5).cast("integer"))

df_employees_balanced = (df_employees
                         .withColumn("salt_id", salt_col)
                         .repartition("salt_id")
                         )

We created a new DataFrame based on df_salaries, also adding the `salt_id` column, now composed of the cartesian product of values ​​from 0 to 4 with the `salt_id` column of df_salaries, increasing the size of our DataFrame by 5 times.

Just like before, the DataFrame is repartitioned using the `salt_id` too.

In [7]:
df_range = spark.range(0, SHUFFLE_PARTITIONS)

df_salaries_balanced = (df_salaries.alias("s")
                        .join(df_range.alias("r"), how="cross")
                        .withColumn("salt_id", F.concat_ws("_", F.col("salary_id"), F.col("r.id")))
                        .repartition("salt_id")
                        .drop(F.col("r.id"))
                        )

Joining using `salt_id` column and demonstrating employee and salary data.

In [8]:
df_balanced = (df_employees_balanced.alias("e")
               .join(df_salaries_balanced.alias("s"), on="salt_id", how="inner")
               .select("id", "s.salary_id", "name", "department", "salary", "salt_id")
               )

df_balanced.show()

+---+---------+--------------------+------------+--------+-------+
| id|salary_id|                name|  department|  salary|salt_id|
+---+---------+--------------------+------------+--------+-------+
|  5|        5|     Warley Reis Jr.|      Música| 6097.59|    5_0|
|  6|        6|Srta. Maria Luiza...| Eletrônicos| 6501.28|    6_2|
| 34|       34|   Fabiano Melo Neto|  Automotivo|17133.47|   34_4|
| 52|       52|       Talita Barros|       Jóias|23644.01|   52_2|
| 56|       56|        Lorena Braga|       Jóias| 9388.97|   56_2|
| 58|       58|          Lara Souza|  Industrial|10079.65|   58_2|
| 61|       61|        Bruna Santos|   Mercearia|23166.84|   61_3|
| 63|       63|        Heloísa Melo|        Casa| 1589.07|   63_1|
| 67|       67|     Yasmin Oliveira|       Jóias|18651.68|   67_1|
| 75|       75|      Joana Nogueira|Computadores|19082.81|   75_3|
| 83|       83|           Davi Melo|  Automotivo|20799.28|   83_4|
| 84|       84|         Félix Silva|    Esportes| 6872.22|   8

Demonstration of balance between the 5 partitions defined in `spark.sql.shuffle.partitions`.

In [9]:
(df_balanced
 .withColumn("partition_id", F.spark_partition_id())
 .groupBy("partition_id")
 .agg(F.count("*").alias("count"))
 .orderBy("partition_id")
 .show()
 )

+------------+-----+
|partition_id|count|
+------------+-----+
|           0| 1993|
|           1| 1932|
|           2| 1991|
|           3| 1996|
|           4| 2088|
+------------+-----+



For small masses of data, the difference in performance is almost imperceptible, probably even worse, but when we are talking about Big Data, especially in a clustered environment, this type of technique can bring huge performance gains and avoid memory leak problems, due to better use of shuffling.

With the emergence of [AQE](https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution), strategies like this end up not being necessary most of the time, but it is always good to keep in mind alternative ways to solve this type of problem.