# guide to spark partitioning: RDD aggregations 1


<br class="Apple-interchange-newline">This notebook has spark.default.parallelism set to 8<br>




In [4]:
import org.apache.spark.HashPartitioner
import org.apache.spark.rdd.RDD

In [1]:
def createPartionedRddWithExplicitPartitioner(name: String,
                                              numRecords: Int,
                                              numPartitions: Int)
        : RDD[(Int, String)] = {

    val data = Range.inclusive(1, numRecords).map { value =>
        value -> s"$name-value"
    }
    spark.sparkContext
        .parallelize(data)
        .partitionBy(new HashPartitioner(numPartitions))
}

def createPartionedRdd(name: String,
                       numRecords: Int,
                       numPartitions: Int)
        : RDD[(Int, String)] = {
    val data = Range.inclusive(1, numRecords).map { value =>
        value -> s"$name-value"
    }
    spark.sparkContext
        .parallelize(data)
        .repartition(numPartitions)
}

If the input RDD has a partitioner on aggregation key, then the number of partitions in the aggregated output RDD is equal to the number of partitions in the input RDD.



In [3]:
val a = createPartionedRddWithExplicitPartitioner("a", 1000, 4)
val b = a.reduceByKey((v1, v2) => v1)

b.getNumPartitions

4

If input RDD does not have a partitioner, then the number of partitions in the aggregated output RDD is equal to the value of ‘spark.default.parallelism’.

In [6]:
val a = createPartionedRdd("a", 1000, 4)
val b = a.reduceByKey((v1, v2) => v1)

b.getNumPartitions

8

For all the three aggregation APIs, another flavor is also provided where the number of output partitions needs to be specified explicitly as the part of the
APIs itself

In [7]:
val a = createPartionedRdd("a", 1000, 4)
val b = a.reduceByKey((v1, v2) => v1, 99)

b.getNumPartitions

99