# Partition

https://kontext.tech/column/spark/296/data-partitioning-in-spark-pyspark-in-depth-walkthrough  
https://kontext.tech/column/spark/299/data-partitioning-functions-in-spark-pyspark-explained  
https://mungingdata.com/apache-spark/partitionby/   
https://stackoverflow.com/questions/65809909/spark-what-is-the-difference-between-repartition-and-repartitionbyrange   
https://www.robinlinacre.com/spark_sort/    
https://stackoverflow.com/questions/32887595/how-does-spark-achieve-sort-order/32888236#32888236

In [35]:
from pyspark.sql import SparkSession

In [36]:
MAX_NUM_CORES = 10

In [37]:
spark = SparkSession.builder \
    .master("spark://IMCHLT276:7077") \
    .config("spark.sql.autoBroadcastJoinThreshold", -1) \
    .config("spark.executor.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.cores.max", f"{MAX_NUM_CORES}") \
    .config("spark.local.dir", "/opt/tmp/spark-temp/") \
    .appName("DataSkewness") \
    .getOrCreate()

sc = spark.sparkContext

In [38]:
spark

-----------------------

**Test how partition size affects the output file numbers**

**Test 1** : Number of partition is equal to the cores

In [39]:
df = spark.range(100000)
df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+
only showing top 20 rows



In [40]:
df.rdd.getNumPartitions() == MAX_NUM_CORES

False

In [41]:
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/

_SUCCESS
part-00000-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet
part-00001-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet
part-00002-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet
part-00003-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet
part-00004-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet
part-00005-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet
part-00006-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet
part-00007-d22ab354-d3f7-4349-a366-86d0c55b321a-c000.snappy.parquet


--------------------------
**Test 2** : Repartition will affect the number ouput files

In [42]:
df = spark.range(100000)
df = df.repartition(20)
df.rdd.getNumPartitions()

20

In [43]:
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/

_SUCCESS
part-00000-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00001-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00002-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00003-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00004-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00005-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00006-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00007-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00008-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00009-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00010-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00011-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00012-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00013-d74c0121-97ea-4697-ba0a-d5bdd1e3bb13-c000.snappy.parquet
part-00014-d74c0121-97ea-4697-ba0a-d5bd

------------------
**Test 3** : Repartition to 1 and see what happens?

In [44]:
df = spark.range(10000000)
df = df.repartition(1)
df.rdd.getNumPartitions()

1

In [45]:
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/ -alh

total 39M
drwxrwxr-x  2 mageswarand mageswarand 4.0K May  5 22:10 .
drwxrwxrwt 37 root        root        444K May  5 22:10 ..
-rw-r--r--  1 mageswarand mageswarand    8 May  5 22:10 ._SUCCESS.crc
-rw-r--r--  1 mageswarand mageswarand 306K May  5 22:10 .part-00000-8790ad36-b9eb-4e52-8373-ea0546c8221a-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand    0 May  5 22:10 _SUCCESS
-rw-r--r--  1 mageswarand mageswarand  39M May  5 22:10 part-00000-8790ad36-b9eb-4e52-8373-ea0546c8221a-c000.snappy.parquet


------------------------
**Test 4** : coalesce

In [46]:
df = spark.range(100000)
df = df.coalesce(1)
df.rdd.getNumPartitions()

1

In [47]:
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/ -alh

total 852K
drwxrwxr-x  2 mageswarand mageswarand 4.0K May  5 22:10 .
drwxrwxrwt 37 root        root        444K May  5 22:10 ..
-rw-r--r--  1 mageswarand mageswarand    8 May  5 22:10 ._SUCCESS.crc
-rw-r--r--  1 mageswarand mageswarand 3.1K May  5 22:10 .part-00000-a7d37985-4487-414b-935a-fe771154d105-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand    0 May  5 22:10 _SUCCESS
-rw-r--r--  1 mageswarand mageswarand 392K May  5 22:10 part-00000-a7d37985-4487-414b-935a-fe771154d105-c000.snappy.parquet


------------------
**Test 5** : Add a text column and repartition to 1 and see waht happens? Size on local disk doesn't matter. On HDFS this may change

In [48]:
import string, random
import pyspark.sql.functions as F
from pyspark.sql.types import *

letters = string.ascii_lowercase
letters_upper = string.ascii_uppercase

for _i in range(0, 10):
    letters += letters

for _i in range(0, 10):
    letters += letters_upper

print("Number of chars to choose from", len(letters))
sample_string = random.sample(letters, 500)
# print("sample_string", ''.join(sample_string))

def random_string(stringLength=200):
    """Generate a random string of fixed length """
    return ''.join(random.sample(letters, stringLength))

random_string_udf = F.udf(random_string,StringType())

Number of chars to choose from 26884


In [49]:
df = spark.range(1000000)
df = df.withColumn("data", random_string_udf())

In [50]:
df = df.repartition(1, F.col("data"))
df = df.select("data")

In [51]:
df.rdd.getNumPartitions()

1

In [52]:
%time
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/ -alh

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 3.81 µs
total 197M
drwxrwxr-x  2 mageswarand mageswarand 4.0K May  5 22:11 .
drwxrwxrwt 37 root        root        444K May  5 22:10 ..
-rw-r--r--  1 mageswarand mageswarand    8 May  5 22:11 ._SUCCESS.crc
-rw-r--r--  1 mageswarand mageswarand 1.6M May  5 22:11 .part-00000-0f2da5c5-cc92-4a3f-8c49-2dc2512b8342-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand    0 May  5 22:11 _SUCCESS
-rw-r--r--  1 mageswarand mageswarand 195M May  5 22:11 part-00000-0f2da5c5-cc92-4a3f-8c49-2dc2512b8342-c000.snappy.parquet


------------------------

In [53]:
from pyspark.sql.functions import spark_partition_id

df.groupBy(spark_partition_id()).count().show()

+--------------------+-------+
|SPARK_PARTITION_ID()|  count|
+--------------------+-------+
|                   0|1000000|
+--------------------+-------+



-----------------------------
**Test 6** : Read back the stored DF with 1 partition and see how many partitions are there? Equals to number of cores

In [54]:
df = spark.read.parquet("/tmp/df_tes/")
df.rdd.getNumPartitions()

8

--------------------------
**Test 7** Store as many paritions and read it back

In [55]:
df = spark.range(1000000)
df = df.withColumn("data", random_string_udf())
df = df.repartition(32, F.col("data"))
df = df.select("data")

In [56]:
%time
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/ -alh

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 9.06 µs
total 197M
drwxrwxr-x  2 mageswarand mageswarand  12K May  5 22:12 .
drwxrwxrwt 37 root        root        444K May  5 22:11 ..
-rw-r--r--  1 mageswarand mageswarand    8 May  5 22:12 ._SUCCESS.crc
-rw-r--r--  1 mageswarand mageswarand  49K May  5 22:12 .part-00000-c59aa035-21bd-425f-9538-54bf009852b1-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand  49K May  5 22:12 .part-00001-c59aa035-21bd-425f-9538-54bf009852b1-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand  49K May  5 22:12 .part-00002-c59aa035-21bd-425f-9538-54bf009852b1-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand  49K May  5 22:12 .part-00003-c59aa035-21bd-425f-9538-54bf009852b1-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand  50K May  5 22:12 .part-00004-c59aa035-21bd-425f-9538-54bf009852b1-c000.snappy.parquet.crc
-rw-r--r--  1 mageswarand mageswarand  49K May  5 22:12 .part-00005-c59aa035-21bd-425f-95

In [57]:
df = spark.read.parquet("/tmp/df_tes/")
df.rdd.getNumPartitions()

8

In [58]:
df.groupBy(spark_partition_id()).count().show()

+--------------------+------+
|SPARK_PARTITION_ID()| count|
+--------------------+------+
|                   1|125740|
|                   6|124307|
|                   3|125104|
|                   5|124643|
|                   4|124935|
|                   7|123645|
|                   2|125380|
|                   0|126246|
+--------------------+------+



--------------------------
**Test 8** : Less number of records and more partitions? 

Spark will try to evenly distribute the data to each partitions. If the total partition number is greater than the actual record count (or RDD size), some partitions will be empty.

In [59]:
df = spark.range(10)
df = df.withColumn("data", random_string_udf())
df = df.repartition(100, F.col("data"))
df = df.select("data")

In [60]:
%time
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.63 µs
_SUCCESS
part-00000-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00016-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00017-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00021-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00028-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00044-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00054-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00089-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00094-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet
part-00099-86982ed7-5db5-4630-accd-f5e9bbfeb8b5-c000.snappy.parquet


In [61]:
res = df.groupBy(spark_partition_id()).agg(F.count("data").alias("id")).orderBy("id")

In [62]:
res.show(1000)

+--------------------+---+
|SPARK_PARTITION_ID()| id|
+--------------------+---+
|                  92|  1|
|                  34|  1|
|                  11|  1|
|                   3|  1|
|                  82|  1|
|                  45|  1|
|                  86|  1|
|                  16|  1|
|                  57|  1|
|                  60|  1|
+--------------------+---+



In [63]:
res.count()

10

---------------------------
**Test 9** Default column repartition? Equals to 200

In [64]:
df = spark.range(10000)
df = df.withColumn("data", random_string_udf())
df = df.repartition(F.col("data"))

In [65]:
%time
! rm -rf /tmp/df_tes/
df.write.parquet("/tmp/df_tes/")
!ls /tmp/df_tes/ | wc -l

CPU times: user 1e+03 ns, sys: 1e+03 ns, total: 2 µs
Wall time: 6.44 µs
201


In [66]:
df.groupBy(spark_partition_id()).count().count()

200

---------------
**Test 10** Multi column parition and write partition keys

In [67]:
from datetime import date, timedelta

In [68]:
start_date = date(2019, 1, 1)
data = []
for i in range(0, 50):
    data.append({"Country": "CN", "Date": start_date + timedelta(days=i), "Amount": 10+i})
    data.append({"Country": "AU", "Date": start_date + timedelta(days=i), "Amount": 10+i})

schema = StructType([StructField('Country', StringType(), nullable=False),
                     StructField('Date', DateType(), nullable=False),
                     StructField('Amount', IntegerType(), nullable=False)])

df = spark.createDataFrame(data, schema=schema)
df.show()
print(df.rdd.getNumPartitions())

+-------+----------+------+
|Country|      Date|Amount|
+-------+----------+------+
|     CN|2019-01-01|    10|
|     AU|2019-01-01|    10|
|     CN|2019-01-02|    11|
|     AU|2019-01-02|    11|
|     CN|2019-01-03|    12|
|     AU|2019-01-03|    12|
|     CN|2019-01-04|    13|
|     AU|2019-01-04|    13|
|     CN|2019-01-05|    14|
|     AU|2019-01-05|    14|
|     CN|2019-01-06|    15|
|     AU|2019-01-06|    15|
|     CN|2019-01-07|    16|
|     AU|2019-01-07|    16|
|     CN|2019-01-08|    17|
|     AU|2019-01-08|    17|
|     CN|2019-01-09|    18|
|     AU|2019-01-09|    18|
|     CN|2019-01-10|    19|
|     AU|2019-01-10|    19|
+-------+----------+------+
only showing top 20 rows

8


In [69]:
df = df.withColumn("Year", F.year("Date")).withColumn("Month", F.month("Date")).withColumn("Day", F.dayofmonth("Date"))
df = df.repartition("Year", "Month", "Day", "Country")
print(df.rdd.getNumPartitions())
df.show()

200
+-------+----------+------+----+-----+---+
|Country|      Date|Amount|Year|Month|Day|
+-------+----------+------+----+-----+---+
|     AU|2019-01-21|    30|2019|    1| 21|
|     CN|2019-01-29|    38|2019|    1| 29|
|     AU|2019-01-19|    28|2019|    1| 19|
|     AU|2019-02-07|    47|2019|    2|  7|
|     AU|2019-02-02|    42|2019|    2|  2|
|     AU|2019-02-05|    45|2019|    2|  5|
|     AU|2019-02-08|    48|2019|    2|  8|
|     CN|2019-01-27|    36|2019|    1| 27|
|     CN|2019-01-21|    30|2019|    1| 21|
|     AU|2019-01-11|    20|2019|    1| 11|
|     CN|2019-01-25|    34|2019|    1| 25|
|     CN|2019-02-06|    46|2019|    2|  6|
|     CN|2019-01-19|    28|2019|    1| 19|
|     CN|2019-02-19|    59|2019|    2| 19|
|     AU|2019-02-03|    43|2019|    2|  3|
|     AU|2019-02-09|    49|2019|    2|  9|
|     CN|2019-01-14|    23|2019|    1| 14|
|     AU|2019-01-16|    25|2019|    1| 16|
|     CN|2019-02-16|    56|2019|    2| 16|
|     AU|2019-01-10|    19|2019|    1| 10|
+------

In [70]:
df.write.partitionBy("Year", "Month", "Day", "Country").mode("overwrite").csv("/tmp/df_tes/", header=True)

In [71]:
!tree /tmp/df_tes/

[01;34m/tmp/df_tes/[00m
├── [01;34mYear=2019[00m
│   ├── [01;34mMonth=1[00m
│   │   ├── [01;34mDay=1[00m
│   │   │   ├── [01;34mCountry=AU[00m
│   │   │   │   └── part-00151-551dc6ed-c92d-4d61-914b-b874ce71b437.c000.csv
│   │   │   └── [01;34mCountry=CN[00m
│   │   │       └── part-00172-551dc6ed-c92d-4d61-914b-b874ce71b437.c000.csv
│   │   ├── [01;34mDay=10[00m
│   │   │   ├── [01;34mCountry=AU[00m
│   │   │   │   └── part-00037-551dc6ed-c92d-4d61-914b-b874ce71b437.c000.csv
│   │   │   └── [01;34mCountry=CN[00m
│   │   │       └── part-00112-551dc6ed-c92d-4d61-914b-b874ce71b437.c000.csv
│   │   ├── [01;34mDay=11[00m
│   │   │   ├── [01;34mCountry=AU[00m
│   │   │   │   └── part-00026-551dc6ed-c92d-4d61-914b-b874ce71b437.c000.csv
│   │   │   └── [01;34mCountry=CN[00m
│   │   │       └── part-00111-551dc6ed-c92d-4d61-914b-b874ce71b437.c000.csv
│   │   ├── [01;34mDay=12[00m
│   │   │   ├── [01;34mCountry=AU[00m
│   │   │   │   └── part-00060-551dc6ed-c92d-4d61

**Read from partitioned data**

Now let’s read the data from the partitioned files with the these criteria:

    Year= 2019
    Month=2
    Day=1
    Country=CN

In [72]:
df = spark.read.csv("/tmp/df_tes/Year=2019/Month=2/Day=1/Country=CN")
print(df.rdd.getNumPartitions()) # only one becaise there is only one record
df.show()

1
+----------+------+
|       _c0|   _c1|
+----------+------+
|      Date|Amount|
|2019-02-01|    41|
+----------+------+



Similarly, we can also query all the data for the second month:

In [73]:
df = spark.read.csv("/tmp/df_tes/Year=2019/Month=2")
print(df.rdd.getNumPartitions())
df.show()

8
+----------+------+---+-------+
|       _c0|   _c1|Day|Country|
+----------+------+---+-------+
|      Date|Amount|  3|     CN|
|2019-02-03|    43|  3|     CN|
|      Date|Amount| 10|     CN|
|2019-02-10|    50| 10|     CN|
|      Date|Amount| 13|     CN|
|2019-02-13|    53| 13|     CN|
|      Date|Amount| 16|     AU|
|2019-02-16|    56| 16|     AU|
|      Date|Amount| 15|     CN|
|2019-02-15|    55| 15|     CN|
|      Date|Amount| 16|     CN|
|2019-02-16|    56| 16|     CN|
|      Date|Amount| 17|     CN|
|2019-02-17|    57| 17|     CN|
|      Date|Amount| 10|     AU|
|2019-02-10|    50| 10|     AU|
|      Date|Amount|  5|     AU|
|2019-02-05|    45|  5|     AU|
|      Date|Amount| 15|     AU|
|2019-02-15|    55| 15|     AU|
+----------+------+---+-------+
only showing top 20 rows



**Use wildcards for partition discovery**

In [74]:
df = spark.read.option("basePath", "/tmp/df_tes/").csv("/tmp/df_tes/Year=*/Month=*/Day=*/Country=CN")
print(df.rdd.getNumPartitions())
df.show()

8
+----------+------+----+-----+---+-------+
|       _c0|   _c1|Year|Month|Day|Country|
+----------+------+----+-----+---+-------+
|      Date|Amount|2019|    2|  3|     CN|
|2019-02-03|    43|2019|    2|  3|     CN|
|      Date|Amount|2019|    1| 17|     CN|
|2019-01-17|    26|2019|    1| 17|     CN|
|      Date|Amount|2019|    2| 10|     CN|
|2019-02-10|    50|2019|    2| 10|     CN|
|      Date|Amount|2019|    1|  3|     CN|
|2019-01-03|    12|2019|    1|  3|     CN|
|      Date|Amount|2019|    1| 24|     CN|
|2019-01-24|    33|2019|    1| 24|     CN|
|      Date|Amount|2019|    2| 13|     CN|
|2019-02-13|    53|2019|    2| 13|     CN|
|      Date|Amount|2019|    1| 25|     CN|
|2019-01-25|    34|2019|    1| 25|     CN|
|      Date|Amount|2019|    1|  1|     CN|
|2019-01-01|    10|2019|    1|  1|     CN|
|      Date|Amount|2019|    1| 21|     CN|
|2019-01-21|    30|2019|    1| 21|     CN|
|      Date|Amount|2019|    2| 15|     CN|
|2019-02-15|    55|2019|    2| 15|     CN|
+--------

We can use wildcards in any part of the path for partition discovery. For example, the following code looks data for month 2 of Country AU:

In [75]:
df = spark.read.option("basePath", "/tmp/df_tes/").csv("/tmp/df_tes/Year=*/Month=2/Day=*/Country=AU")
print(df.rdd.getNumPartitions())
df.show()

7
+----------+------+----+-----+---+-------+
|       _c0|   _c1|Year|Month|Day|Country|
+----------+------+----+-----+---+-------+
|      Date|Amount|2019|    2| 16|     AU|
|2019-02-16|    56|2019|    2| 16|     AU|
|      Date|Amount|2019|    2| 10|     AU|
|2019-02-10|    50|2019|    2| 10|     AU|
|      Date|Amount|2019|    2|  5|     AU|
|2019-02-05|    45|2019|    2|  5|     AU|
|      Date|Amount|2019|    2| 15|     AU|
|2019-02-15|    55|2019|    2| 15|     AU|
|      Date|Amount|2019|    2| 12|     AU|
|2019-02-12|    52|2019|    2| 12|     AU|
|      Date|Amount|2019|    2|  1|     AU|
|2019-02-01|    41|2019|    2|  1|     AU|
|      Date|Amount|2019|    2|  8|     AU|
|2019-02-08|    48|2019|    2|  8|     AU|
|      Date|Amount|2019|    2|  6|     AU|
|2019-02-06|    46|2019|    2|  6|     AU|
|      Date|Amount|2019|    2| 14|     AU|
|2019-02-14|    54|2019|    2| 14|     AU|
|      Date|Amount|2019|    2| 13|     AU|
|2019-02-13|    53|2019|    2| 13|     AU|
+--------

## Data Partitioning Functions  

In [76]:
from pyspark.rdd import portable_hash
from pyspark import Row

In [77]:
# Populate sample data
countries = ("CN", "AU", "US")
data = []
for i in range(1, 13):
    data.append({"ID": i, "Country": countries[i % 3],  "Amount": 10+i})

df = spark.createDataFrame(data)
df.show()



+------+-------+---+
|Amount|Country| ID|
+------+-------+---+
|    11|     AU|  1|
|    12|     US|  2|
|    13|     CN|  3|
|    14|     AU|  4|
|    15|     US|  5|
|    16|     CN|  6|
|    17|     AU|  7|
|    18|     US|  8|
|    19|     CN|  9|
|    20|     AU| 10|
|    21|     US| 11|
|    22|     CN| 12|
+------+-------+---+



In [78]:
def print_partitions(df):
    numPartitions = df.rdd.getNumPartitions()
    print("Total partitions: {}\n".format(numPartitions))
    print("Partitioner: {}\n".format(df.rdd.partitioner))
    df.explain()
    print("\n")
    parts = df.rdd.glom().collect()
    i = 0
    j = 0
    for p in parts:
        print("\nPartition {}:".format(i))
        for r in p:
            print("Row {}:{}".format(j, r))
            j = j+1
        i = i+1

In [79]:
print_partitions(df)

Total partitions: 8

Partitioner: None

== Physical Plan ==
Scan ExistingRDD[Amount#796L,Country#797,ID#798L]



Partition 0:
Row 0:Row(Amount=11, Country='AU', ID=1)

Partition 1:
Row 1:Row(Amount=12, Country='US', ID=2)
Row 2:Row(Amount=13, Country='CN', ID=3)

Partition 2:
Row 3:Row(Amount=14, Country='AU', ID=4)

Partition 3:
Row 4:Row(Amount=15, Country='US', ID=5)
Row 5:Row(Amount=16, Country='CN', ID=6)

Partition 4:
Row 6:Row(Amount=17, Country='AU', ID=7)

Partition 5:
Row 7:Row(Amount=18, Country='US', ID=8)
Row 8:Row(Amount=19, Country='CN', ID=9)

Partition 6:
Row 9:Row(Amount=20, Country='AU', ID=10)

Partition 7:
Row 10:Row(Amount=21, Country='US', ID=11)
Row 11:Row(Amount=22, Country='CN', ID=12)


In [80]:
df = df.repartition(3, "Country")

In [81]:
print_partitions(df)

Total partitions: 3

Partitioner: None

== Physical Plan ==
Exchange hashpartitioning(Country#797, 3)
+- Scan ExistingRDD[Amount#796L,Country#797,ID#798L]



Partition 0:

Partition 1:
Row 0:Row(Amount=15, Country='US', ID=5)
Row 1:Row(Amount=16, Country='CN', ID=6)
Row 2:Row(Amount=21, Country='US', ID=11)
Row 3:Row(Amount=22, Country='CN', ID=12)
Row 4:Row(Amount=12, Country='US', ID=2)
Row 5:Row(Amount=13, Country='CN', ID=3)
Row 6:Row(Amount=18, Country='US', ID=8)
Row 7:Row(Amount=19, Country='CN', ID=9)

Partition 2:
Row 8:Row(Amount=14, Country='AU', ID=4)
Row 9:Row(Amount=20, Country='AU', ID=10)
Row 10:Row(Amount=11, Country='AU', ID=1)
Row 11:Row(Amount=17, Country='AU', ID=7)


**You may expect that each partition includes data for each Country but that is not the case. Why? Because repartition function by default uses hash partitioning. For different country code, it may be allocated into the same partition number.**
We can verify this by using the following code to calculate the hash

In [82]:
udf_portable_hash = F.udf(lambda str: portable_hash(str))
df = df.withColumn("Hash#", udf_portable_hash(df.Country))
df = df.withColumn("Partition#", df["Hash#"] % 3)
df.show()

+------+-------+---+--------------------+----------+
|Amount|Country| ID|               Hash#|Partition#|
+------+-------+---+--------------------+----------+
|    12|     US|  2|-8328537658613580243|      -1.0|
|    13|     CN|  3|-7458853143580063552|      -1.0|
|    18|     US|  8|-8328537658613580243|      -1.0|
|    19|     CN|  9|-7458853143580063552|      -1.0|
|    15|     US|  5|-8328537658613580243|      -1.0|
|    16|     CN|  6|-7458853143580063552|      -1.0|
|    21|     US| 11|-8328537658613580243|      -1.0|
|    22|     CN| 12|-7458853143580063552|      -1.0|
|    14|     AU|  4| 6593628092971972691|       0.0|
|    20|     AU| 10| 6593628092971972691|       0.0|
|    11|     AU|  1| 6593628092971972691|       0.0|
|    17|     AU|  7| 6593628092971972691|       0.0|
+------+-------+---+--------------------+----------+



The output shows that each country’s data is now located in the same partition:

In [83]:
countries = ("CN", "AU", "US")
def country_partitioning(k):
    return countries.index(k)
    
udf_country_hash = F.udf(lambda str: country_partitioning(str))

In [84]:
numPartitions = 3
# df = df.partitionBy(numPartitions, country_partitioning)
df = df.withColumn("Hash#", udf_country_hash(df['Country']))
df = df.withColumn("Partition#", df["Hash#"] % numPartitions)
df.orderBy('Country').show()

+------+-------+---+-----+----------+
|Amount|Country| ID|Hash#|Partition#|
+------+-------+---+-----+----------+
|    17|     AU|  7|    1|       1.0|
|    14|     AU|  4|    1|       1.0|
|    20|     AU| 10|    1|       1.0|
|    11|     AU|  1|    1|       1.0|
|    22|     CN| 12|    0|       0.0|
|    16|     CN|  6|    0|       0.0|
|    19|     CN|  9|    0|       0.0|
|    13|     CN|  3|    0|       0.0|
|    15|     US|  5|    2|       2.0|
|    12|     US|  2|    2|       2.0|
|    18|     US|  8|    2|       2.0|
|    21|     US| 11|    2|       2.0|
+------+-------+---+-----+----------+



In [85]:
print_partitions(df)

Total partitions: 3

Partitioner: None

== Physical Plan ==
*(1) Project [Amount#796L, Country#797, ID#798L, pythonUDF1#874 AS Hash##843, (cast(pythonUDF1#874 as double) % 3.0) AS Partition##849]
+- BatchEvalPython [<lambda>(Country#797), <lambda>(Country#797)], [Amount#796L, Country#797, ID#798L, pythonUDF0#873, pythonUDF1#874]
   +- Exchange hashpartitioning(Country#797, 3)
      +- Scan ExistingRDD[Amount#796L,Country#797,ID#798L]



Partition 0:

Partition 1:
Row 0:Row(Amount=15, Country='US', ID=5, Hash#='2', Partition#=2.0)
Row 1:Row(Amount=16, Country='CN', ID=6, Hash#='0', Partition#=0.0)
Row 2:Row(Amount=21, Country='US', ID=11, Hash#='2', Partition#=2.0)
Row 3:Row(Amount=22, Country='CN', ID=12, Hash#='0', Partition#=0.0)
Row 4:Row(Amount=12, Country='US', ID=2, Hash#='2', Partition#=2.0)
Row 5:Row(Amount=13, Country='CN', ID=3, Hash#='0', Partition#=0.0)
Row 6:Row(Amount=18, Country='US', ID=8, Hash#='2', Partition#=2.0)
Row 7:Row(Amount=19, Country='CN', ID=9, Hash#='0', Pa

In [86]:
print_partitions(df.repartition(3, "Partition#"))

Total partitions: 3

Partitioner: None

== Physical Plan ==
Exchange hashpartitioning(Partition##849, 3)
+- *(1) Project [Amount#796L, Country#797, ID#798L, pythonUDF1#876 AS Hash##843, (cast(pythonUDF1#876 as double) % 3.0) AS Partition##849]
   +- BatchEvalPython [<lambda>(Country#797), <lambda>(Country#797)], [Amount#796L, Country#797, ID#798L, pythonUDF0#875, pythonUDF1#876]
      +- Exchange hashpartitioning(Country#797, 3)
         +- Scan ExistingRDD[Amount#796L,Country#797,ID#798L]



Partition 0:
Row 0:Row(Amount=12, Country='US', ID=2, Hash#='2', Partition#=2.0)
Row 1:Row(Amount=18, Country='US', ID=8, Hash#='2', Partition#=2.0)
Row 2:Row(Amount=15, Country='US', ID=5, Hash#='2', Partition#=2.0)
Row 3:Row(Amount=21, Country='US', ID=11, Hash#='2', Partition#=2.0)

Partition 1:
Row 4:Row(Amount=13, Country='CN', ID=3, Hash#='0', Partition#=0.0)
Row 5:Row(Amount=19, Country='CN', ID=9, Hash#='0', Partition#=0.0)
Row 6:Row(Amount=16, Country='CN', ID=6, Hash#='0', Partition#=0.0

# Range Partitions

In [87]:
from pyspark.sql.types import IntegerType
df_test_1 = [i for i in range(10000)]
df_test_1 = spark.createDataFrame(df_test_1, schema=IntegerType())
df_test_1.show()

+-----+
|value|
+-----+
|    0|
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
|   11|
|   12|
|   13|
|   14|
|   15|
|   16|
|   17|
|   18|
|   19|
+-----+
only showing top 20 rows



In [88]:
df_test_2 = [0 for i in range(10000)] + [500, 1000, 10000]
df_test_2 = spark.createDataFrame(df_test_2, schema=IntegerType())
df_test_2.show()

+-----+
|value|
+-----+
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
+-----+
only showing top 20 rows



`repartition` applies the `HashPartitioner` when one or more columns are provided and the `RoundRobinPartitioner` which distributes the data evenly across the provided number of partitions. If one column (or more) is provided, those values will be hashed and used to determine the partition number by calculating something like `partition = hash(columns) % numberOfPartitions`.

`repartitionByRange` will partition the data based on a range of the column values. This is usually used for continuous (not discrete) values such as any kind of numbers. `Note that due to performance reasons this method uses sampling to estimate the ranges`. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config `spark.sql.execution.rangeExchange.sampleSizePerPartition`.

It is also worth mentioning that for both methods if no `numPartitions` is given, by default it partitions the Dataframe data into `spark.sql.shuffle.partitions` configured in your Spark session, and could be coalesced by Adaptive Query Execution (available since Spark 3.x).

In [89]:
import pyspark.sql.functions as F# spark_partition_id
# applying SQL built-in function to determin actual partition

def get_partition_info(df):
    test_res_df = df\
        .withColumn("partition", F.spark_partition_id()) \
        .groupBy(F.col("partition"))\
        .agg(F.count(F.col("value")).alias("count"),\
          F.min(F.col("value")).alias("min_value"),\
          F.max(F.col("value")).alias("max_value"))\
        .orderBy(F.col("partition"))

    test_res_df.show()
    
get_partition_info(df_test_1)
get_partition_info(df_test_2)

+---------+-----+---------+---------+
|partition|count|min_value|max_value|
+---------+-----+---------+---------+
|        0| 1024|        0|     1023|
|        1| 1024|     1024|     2047|
|        2| 1024|     2048|     3071|
|        3| 2048|     3072|     5119|
|        4| 1024|     5120|     6143|
|        5| 1024|     6144|     7167|
|        6| 1024|     7168|     8191|
|        7| 1808|     8192|     9999|
+---------+-----+---------+---------+

+---------+-----+---------+---------+
|partition|count|min_value|max_value|
+---------+-----+---------+---------+
|        0| 1024|        0|        0|
|        1| 1024|        0|        0|
|        2| 1024|        0|        0|
|        3| 2048|        0|        0|
|        4| 1024|        0|        0|
|        5| 1024|        0|        0|
|        6| 1024|        0|        0|
|        7| 1811|        0|    10000|
+---------+-----+---------+---------+



As expected, we get 4 partitions and because the values of df are ranging from 0 to 1000000 we see that their hashed values will result in a well distributed Dataframe.

In [90]:
get_partition_info(df_test_1.repartition(4, F.col("value")))

+---------+-----+---------+---------+
|partition|count|min_value|max_value|
+---------+-----+---------+---------+
|        0| 2490|       12|     9991|
|        1| 2518|        6|     9999|
|        2| 2507|        2|     9997|
|        3| 2485|        0|     9992|
+---------+-----+---------+---------+



Also in this case, we get 4 partitions but this time the min and max values clearly shows the ranges of values within a partition. It is almost equally distributed with 250000 values per partition.

In [91]:
get_partition_info(df_test_1.repartitionByRange(4, F.col("value")))

+---------+-----+---------+---------+
|partition|count|min_value|max_value|
+---------+-----+---------+---------+
|        0| 2515|        0|     2514|
|        1| 2462|     2515|     4976|
|        2| 2510|     4977|     7486|
|        3| 2513|     7487|     9999|
+---------+-----+---------+---------+



Now, we are using the other Dataframe df_test_2. Here, the hashing algorithm is hashing the values which are only 0, 5000, 10000 or 100000. Of course, the hash of the value 0 will always be the same, so all Zeros end up in the same partition (in this case partition 3). The other two partitions only contain one value.

In [92]:
get_partition_info(df_test_2.repartition(4, F.col("value")))

+---------+-----+---------+---------+
|partition|count|min_value|max_value|
+---------+-----+---------+---------+
|        1|    2|     1000|    10000|
|        2|    1|      500|      500|
|        3|10000|        0|        0|
+---------+-----+---------+---------+



Without using the content of the column "value" the repartition method will distribute the messages on a RoundRobin basis. All partitions have almost the same amount of data.

In [93]:
get_partition_info(df_test_2.repartition(4))

+---------+-----+---------+---------+
|partition|count|min_value|max_value|
+---------+-----+---------+---------+
|        0| 2501|        0|      500|
|        1| 2501|        0|    10000|
|        2| 2500|        0|        0|
|        3| 2501|        0|     1000|
+---------+-----+---------+---------+



This case shows that the Dataframe df2 is not well defined for a repartitioning by ranges as almost all values are 0. Therefore, we even end up having only two partitions whereas the partition 0 contains all Zeros.

In [94]:
get_partition_info(df_test_2.repartitionByRange(4, F.col("value")))

+---------+-----+---------+---------+
|partition|count|min_value|max_value|
+---------+-----+---------+---------+
|        0|10000|        0|        0|
|        1|    3|      500|    10000|
+---------+-----+---------+---------+

