# guide to spark partitioning: DataSet aggregations


This notebook has:

spark.sql.shuffle.partitions set to 50




In [4]:
case class MyRecord(key: Int, value: String)

In [3]:
def createPartionedDataset(name: String,
                           numRecords: Int,
                           numPartitions: Int)
        : Dataset[MyRecord] = {
    Range.inclusive(1, numRecords).map { value =>
        MyRecord(value, s"$name-value")
    }.toDS.repartition(numPartitions).localCheckpoint()
}

def createPartionedDatasetOnKey(name: String,
                                numRecords: Int,
                                numPartitions: Int)
        : Dataset[MyRecord] = {
    Range.inclusive(1, numRecords).map { value =>
        MyRecord(value, s"$name-value")
    }.toDS.repartition(numPartitions, $"key").localCheckpoint()
}

If the input Dataset (to be aggregated) is already partitioned strictly on the basis of either all or subset of the attributes of the aggregation key, then the output aggregated Dataset has the same number of partitions as in the parent Dataset. The input Dataset can already be partitioned due to a previous transformation of repartition, aggregation or join type.



In [2]:
val a = createPartionedDatasetOnKey("a", 1000, 4)
val b = a.groupBy($"key").count()

println(a.queryExecution.executedPlan.outputPartitioning)

b.rdd.getNumPartitions

hashpartitioning(key#370, 4)


4

If the input Dataset is not already partitioned on the basis of all or subset of the attributes of the aggregation key, then the number of partitions in the
output aggregated Dataset is equal to value of Spark config ‘spark.sql.shuffle.partitions’, the default value for which is always set to 200.

In [6]:
val a = createPartionedDataset("a", 1000, 4)
val b = a.groupBy($"key").count()

println(a.queryExecution.executedPlan.outputPartitioning)

b.rdd.getNumPartitions

RoundRobinPartitioning(4)


50