# guide to spark partitioning: RDD union


This is a text cell. Start editing!




In [1]:
import org.apache.spark.Partitioner
import org.apache.spark.rdd.RDD

In [2]:
spark.version

2.4.7

In [3]:
class CustomHashPartitioner(override val numPartitions: Int) extends Partitioner {
  override def getPartition(key: Any): Int = key.hashCode % numPartitions
}

In [4]:
def createRdd(name: String, numRecords: Int): RDD[(Int, String)] = {
    val data = Range.inclusive(1, numRecords).map { value =>
        value -> s"$name-value"
    }
    spark.sparkContext
        .parallelize(data)
}

If all input RDDs have the same Partitioner and equal number of partitions, then the number of partitions in the resultant RDD is same as in each of the input RDD.



In [6]:
val partitioner = new CustomHashPartitioner(5)

val a = createRdd("a", 1000).partitionBy(partitioner)
val b = createRdd("a", 500).partitionBy(partitioner)
val c = a.union(b)

c.getNumPartitions

5

If input RDDs differ in number of partitions or have different Partitioners, then the Union transformation adds up the number
of partitions in all the parent RDDs to determine the number of partitions in the output RDD. This is true even when one of the
RDD among all the input RDDs possess different partitioners or different number of partitions as compared to the rest of the lot.

In [8]:
val partitioner = new CustomHashPartitioner(5)

val a = createRdd("a", 1000).partitionBy(partitioner)
val b = createRdd("b", 500).repartition(5)
val c = a.union(b)

c.getNumPartitions

10

If Partitioner is absent in one or more input RDDs, then, also the Union transformation adds up the number of partitions in the
input RDDs to determine the number of partitions in the output RDD. This is true even when all the input RDDs have same
number of partitions.

In [10]:
val partitioner = new CustomHashPartitioner(5)

val a = createRdd("a", 1000).partitionBy(partitioner)
val b = createRdd("b", 500)
val c = a.union(b)

c.getNumPartitions

13