* `coalesce` and `repartition` are functions on top of the dataframe. Do not get confused between **coalesce** on Data Frame and the coalesce function available to deal with null values in a given column.
* `coalesce` is typically used to **reduce number of partitions** to deal with as part of downstream processing. 
* `repartition` is used to reshuffle the data to **higher or lower number of partitions** to deal with as part of downstream partitioning.
* Make sure to use a cluster with higher configuration, if you would like to run and experience by your self.
  * 2 to 3 worker nodes using Standard with 14 to 16 GB RAM and 4 cores each.

In [0]:
df = spark.read.csv('dbfs:/databricks-datasets/asa/airlines', header=True)

In [0]:
help(df.coalesce)

In [0]:
help(df.repartition)

* `repartition` incurs **shuffling** and it takes time as data has to be shuffled to newer number of partitions.
* Also you can `repartition` the Data Frame based on specified columns.
* `coalesce` does not incur shuffling.
* We use `coalesce` quite often before writing the data to fewer number of files.

In [0]:
df = spark.read.csv('dbfs:/databricks-datasets/asa/airlines', header=True, inferSchema=True)

In [0]:
dbutils.fs.ls('dbfs:/databricks-datasets/asa/airlines')

In [0]:
df.rdd.getNumPartitions()

In [0]:
# coalescing the Dataframe to 16
df.coalesce(16).rdd.getNumPartitions()

In [0]:
# not effective as coalesce can be used to reduce the number of partitioning.
# Faster as no shuffling is involved
df.coalesce(186).rdd.getNumPartitions()

In [0]:
# incurs shuffling
# Watch the execution time and compare with coalesce
df.repartition(16).rdd.getNumPartitions()

In [0]:
# repartitioned to higher number of partitions
df.repartition(186, 'Year', 'Month').rdd.getNumPartitions()

One of the common usage of `coalesce` is to write to lesser number of files.

In [0]:
import getpass
username = getpass.getuser()

In [0]:

dbutils.fs.rm(f'/user/{username}/airlines', recurse=True)

In [0]:
df.write.mode('overwrite').csv(f'/user/{username}/airlines', header=True, compression='gzip')

In [0]:
dbutils.fs.ls(f'/user/{username}/airlines')

In [0]:
df.repartition(16).write.mode('overwrite').csv(f'/user/{username}/airlines', header=True, compression='gzip')

In [0]:
dbutils.fs.ls(f'/user/{username}/airlines')

In [0]:
# If you use repartition it will take longer time than this.
df.coalesce(16).write.mode('overwrite').csv(f'/user/{username}/airlines', header=True, compression='gzip')

In [0]:
dbutils.fs.ls(f'/user/{username}/airlines')