# guide to spark partitioning: dataset coalesce / repartition

spark.sql.shuffle.partitions is set to 50<br>




In [1]:
import org.apache.spark.TaskContext

In [2]:
case class MyRecord(key: Int, value: String)

In [3]:
def createDataset(name: String,
                  numRecords: Int,
                  numPartitions: Int): Dataset[MyRecord] = {
    Range.inclusive(1, numRecords).map { value =>
        MyRecord(value, s"$name-value")
    }.toDS
}

Repartitions the Dataset ‘A’ by the specified partitioning expressions using hash partitioning approach, the number of output partitions being implicitly specified by the config property ‘spark.sql.shuffle.partitions’



In [7]:
createDataset("a", 1000, 100).repartition($"key").rdd.getNumPartitions

50

Coalesce, being a narrow transformation, it shares the stage computation with the upstream transformations (on the upstream Datasets or RDDs) until a stage barrier is encountered in the upstream RDD DAG of the Spark Job. This sharing of the stage reduces the parallelism of the whole stage in proportion to the reduced number of partitions specified in the Coalesce API. Therefore, if you drastically reduce the number of partitions with coalesce, it would also drastically hit the parallelism of the transformations on the upstream Datasets or RDDs that share the same stage with coalesce.



In [5]:
val a = createDataset("a", 1000, 100)

a.map(_ => TaskContext.getPartitionId()).coalesce(2).distinct.show()

+-----+
|value|
+-----+
|    0|
|    1|
+-----+

