# Guide to spark partitioning: RDD Joins 1 / 2<br>


This notebook has spark.default.parallelism set to 8




In [1]:
import org.apache.spark.Partitioner
import org.apache.spark.rdd.RDD

In [2]:
spark.version

2.4.7

In [3]:
spark.sparkContext.getConf.getAll.foreach { case (key, value) => 
    println(s"$key: $value") 
}

spark.driver.host: 127.0.0.1
spark.home: /home/jkuperus/Tools/spark-2.4.7-bin-hadoop2.6
spark.app.id: local-1604354386984
spark.executor.id: driver
spark.driver.port: 37821
spark.jars: /home/jkuperus/dev/tools/polynote/deps/polynote-spark-runtime.jar,/home/jkuperus/dev/tools/polynote/deps/polynote-spark-runtime.jar,/home/jkuperus/dev/tools/polynote/deps/polynote-spark-runtime.jar,https://repo1.maven.org/maven2/com/github/jelmerk/hnswlib-utils/0.0.46/hnswlib-utils-0.0.46.jar,https://repo1.maven.org/maven2/com/github/jelmerk/hnswlib-scala_2.11/0.0.46/hnswlib-scala_2.11-0.0.46.jar,https://repo1.maven.org/maven2/org/eclipse/collections/eclipse-collections-api/9.2.0/eclipse-collections-api-9.2.0.jar,https://repo1.maven.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar,https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.11/2.4.4/spark-avro_2.11-2.4.4.jar,https://repo1.maven.org/maven2/com/github/jelmerk/hnswlib-spark_2.3.0_2.11/0.0.46/hnswlib-spark_2.3.0_2.11-0.0.46.j

In [4]:
class CustomHashPartitioner(override val numPartitions: Int) extends Partitioner {
  override def getPartition(key: Any): Int = key.hashCode % numPartitions
}

In [5]:
def createPartionedRddWithExplicitPartitioner(name: String,
                                              numRecords: Int,
                                              numPartitions: Int)
        : RDD[(Int, String)] = {
    val data = Range.inclusive(1, numRecords).map { value =>
        value -> s"$name-value"
    }
    spark.sparkContext
        .parallelize(data)
        .partitionBy(new CustomHashPartitioner(numPartitions))
}

In [6]:
def createPartionedRdd(name: String,
                       numRecords: Int,
                       numPartitions: Int)
        : RDD[(Int, String)] = {
    val data = Range.inclusive(1, numRecords).map { value =>
        value -> s"$name-value"
    }
    spark.sparkContext
        .parallelize(data)
        .repartition(numPartitions)
}

## RDD's



**1) **If no input Pair RDD (participating in the join operation) has a partitioner on the associated Key, then the number of partitions in the output joined RDD is equal to the value configured for the config property 'spark.default.parallelism'




In [9]:
val a = createPartionedRdd("a", 1000, 6)
val b = createPartionedRdd("b", 500, 2)

a.join(b).getNumPartitions

10

**2)** If one or both input RDDs have a partitioner on the Key, then the maximum value of the number of partitions among the partitioner carrying input RDDs is compared with the config property 'spark.default.parallelism'

**2a)** If the config property is set to some value and the maximum value is above the config value, then the maximum value is chosen as the number of partitions in the output joined RDD



In [11]:
val a = createPartionedRddWithExplicitPartitioner("a", 1000, 20)
val b = createPartionedRdd("b", 500, 2)

println(a.join(b).getNumPartitions)
println(a.join(b).partitioner == a.partitioner)

20
true


**2b)** If the config property is set to some value and the maximum value is below that value, but, the maximum value is within a single order of magnitude of the config value, then the maximum value is chosen as the number of partitions in the output joined RDD




In [13]:
val a = createPartionedRddWithExplicitPartitioner("a", 1000, 6)
val b = createPartionedRdd("b", 500, 2)

a.join(b).getNumPartitions

6

**2c)** If the config property is set to some value and the maximum value is below that value, but, the maximum value is not within a single order of magnitude of the highest number of partitions among input RDDs, then config property value is chosen as the number of partitions in the output joined RDD


**This seems to be false and does not work on spark 2.3.2 and spark 2.4.7**<br>


*Note in older versions of Spark, if either one or both parent Pair RDDs have partitioner on the Key object, then the maximum of number of partitions among partitioner contained input RDDs is always chosen as the number of partitions for the output joined RDD irrespective of any value being set for the config property 'spark.default.paralelism'*




In [15]:
val a = createPartionedRddWithExplicitPartitioner("a", 1000, 1)
val b = createPartionedRddWithExplicitPartitioner("b", 500, 1)

a.join(b).getNumPartitions

1