# PySpark - repartition() and coalesce()
repartition() is used to increase or decrease the RDD/DataFrame partitions whereas the PySpark coalesce() is used to only decrease the number of partitions in an efficient way.

Note: PySpark repartition() and coalesce() are very expensive operations as they shuffle the data across many partitions hence try to minimize using these as much as possible.

In RDD, you can create parallelism at the time of the creation of an RDD using parallelize(), textFile() and wholeTextFiles().

In [2]:
import findspark
findspark.init()
# Create SparkSession from builder
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

In [4]:
# Create spark session with local[5]
rdd = spark.sparkContext.parallelize(range(0,20))
print("From local[5] : "+str(rdd.getNumPartitions()))

# Use parallelize with 6 partitions
rdd1 = spark.sparkContext.parallelize(range(0,25), 6)
print("parallelize : "+str(rdd1.getNumPartitions()))

rddFromFile = spark.sparkContext.textFile("../resources/tmp/test.txt",10)
print("TextFile : "+str(rddFromFile.getNumPartitions()))

From local[5] : 1
parallelize : 6
TextFile : 10


## RDD repartition()
repartition() method is used to increase or decrease the partitions. The below example decreases the partitions from 10 to 4 by moving data from all partitions.

In [6]:
# Using repartition
rdd2 = rdd1.repartition(4)
print("Repartition size : "+str(rdd2.getNumPartitions()))
rdd2.saveAsTextFile("../resources/tmp/re-partition")

Repartition size : 4


## coalesce()
RDD coalesce() is used only to reduce the number of partitions. This is an optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce.

In [7]:
# Using coalesce()
rdd3 = rdd1.coalesce(4)
print("Repartition size : "+str(rdd3.getNumPartitions()))
rdd3.saveAsTextFile("/tmp/coalesce")

Repartition size : 4


## PySpark DataFrame repartition() vs coalesce()
Like RDD, you can’t specify the partition/parallelism while creating DataFrame. DataFrame by default internally uses the methods specified in Section 1 to determine the default partition and splits the data for parallelism.

In [8]:
# DataFrame example
df=spark.range(0,20)
print(df.rdd.getNumPartitions())

df.write.mode("overwrite").csv("../resources/tmp/partition.csv")

1


### DataFrame repartition()
Similar to RDD, the PySpark DataFrame repartition() method is used to increase or decrease the partitions. The below example increases the partitions from 5 to 6 by moving data from all partitions.

In [9]:
# DataFrame repartition
df2 = df.repartition(6)
print(df2.rdd.getNumPartitions())

6


And, even decreasing the partitions also results in moving data from all partitions. hence when you wanted to decrease the partition recommendation is to use coalesce()/

### DataFrame coalesce()
Spark DataFrame coalesce() is used only to decrease the number of partitions. This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce.

In [10]:
# DataFrame coalesce
df3 = df.coalesce(2)
print(df3.rdd.getNumPartitions())

1


Default Shuffle Partition
Calling groupBy(), union(), join() and similar functions on DataFrame results in shuffling data between multiple executors and even machines and finally repartitions data into 200 partitions by default. PySpark default defines shuffling partition to 200 using spark.sql.shuffle.partitions configuration.

In [11]:
# Default shuffle partition count
df4 = df.groupBy("id").count()
print(df4.rdd.getNumPartitions())

1


Post shuffle operations, you can change the partitions either using coalesce() or repartition().