# guide to spark partitioning: DataSet output


<br>




In [1]:
import java.nio.file.Files
import org.apache.spark.sql.SaveMode
import sys.process._

import scala.concurrent.forkjoin.ThreadLocalRandom

In [2]:
val categories = Array("fruit", "cars", "animals")

case class MyRecord(key: Int, category: String, value: String)

In [3]:
def createPartionedDataset(name: String,
                           numRecords: Int,
                           numPartitions: Int)
        : Dataset[MyRecord] = {

    Range.inclusive(1, numRecords).map { value =>

        val randomCategory = categories(ThreadLocalRandom.current.nextInt(categories.size))
        MyRecord(value, randomCategory, s"$name-value")
    }.toDS.repartition(numPartitions).localCheckpoint()
}

In [4]:
val tmpDir = Files.createTempDirectory("spark").toFile.getCanonicalPath

The most basic approach that is adopted in Spark to write an RDD/Dataset on a filesystem is to write each partition of the
RDD/Dataset in to a separate part file. All the part files, belonging to the same RDD/Dataset, are written in a common directory.

In [6]:
createPartionedDataset("a", 1000, 4)
    .write.mode(SaveMode.Overwrite).parquet(tmpDir)


In [7]:
s"ls -l $tmpDir".!

total 16
-rw-r--r-- 1 jkuperus jkuperus 2009 Nov  6 09:13 part-00000-5ad010fb-a023-4e2a-8da6-fce7d2c7332d-c000.snappy.parquet
-rw-r--r-- 1 jkuperus jkuperus 2009 Nov  6 09:13 part-00001-5ad010fb-a023-4e2a-8da6-fce7d2c7332d-c000.snappy.parquet
-rw-r--r-- 1 jkuperus jkuperus 2009 Nov  6 09:13 part-00002-5ad010fb-a023-4e2a-8da6-fce7d2c7332d-c000.snappy.parquet
-rw-r--r-- 1 jkuperus jkuperus 2043 Nov  6 09:13 part-00003-5ad010fb-a023-4e2a-8da6-fce7d2c7332d-c000.snappy.parquet
-rw-r--r-- 1 jkuperus jkuperus    0 Nov  6 09:13 _SUCCESS


0

In this approach, which is only applicable to only Datasets, an additional ‘partitionBy’ expression is also specified in the writer API. Usually, the expression consists of one or more data fields of the Dataset schema. While writing the data records in a partition, for each of the records, firstly a sub-directory is identified based on the value of ‘partitionBy; expression evaluated for the record. The sub-directory has to be present within the primary directory being specified for the Dataset in the writer API. If the identified sub-directory is not present, then the same is created first. Eventually, the data record is written in the partition corresponding file within the identified sub-directory.

‘partitionBy’ approach would prove helpful when you read back the written data by applying a filter on the basis of ‘partitionBy’ expression. Because, Spark can then only read the desired sub-directories according to filtering expression leaving the others



In [9]:
createPartionedDataset("a", 1000, 4)
    .write.mode(SaveMode.Overwrite).partitionBy("category").parquet(tmpDir)

In [10]:
s"find $tmpDir -name *.parquet".!

/tmp/spark6577489937337534733/category=animals/part-00003-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=animals/part-00002-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=animals/part-00001-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=animals/part-00000-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=fruit/part-00003-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=fruit/part-00002-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=fruit/part-00001-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=fruit/part-00000-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b199.c000.snappy.parquet
/tmp/spark6577489937337534733/category=cars/part-00003-6ae9ddb2-cfa8-4bdf-80b4-92bd61c9b

0

In [11]:
val df = spark.read.parquet(tmpDir)

println(df.rdd.getNumPartitions)
println(df.queryExecution.executedPlan.outputPartitioning)

6
UnknownPartitioning(0)


In this approach, which is also applicable to only Datasets, an additional ‘bucketBy’ expression along with the number of buckets is also specified in the writer API. Usually, the expression consists of one or more data fields of the Dataset schema. Here, while writing the data records in a partition, for each record, a bucket is calculated first, based on which, the data record is written in a partition and bucket specific file. Therefore, if a partition can map all its data records to all the available buckets, then the number of files in the primary directory for that partition would be equal to the number of buckets. In the ‘bucketBy’ approach, if one also wants to store the bucketing specs then a table name is also additionally specified in the writer APIs. This table would then store the bucketing specs in the table meta space, which can be retrieved during the read operation .



In [13]:
createPartionedDataset("a", 1000, 4)
    .write.mode(SaveMode.Overwrite).option("path", tmpDir).bucketBy(3, "category").saveAsTable("records")

In [14]:
s"find $tmpDir -name *.parquet".!

/tmp/spark6577489937337534733/part-00003-34cb36a6-04b0-419e-a67a-38da5b5f962a_00000.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00003-34cb36a6-04b0-419e-a67a-38da5b5f962a_00001.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00001-34cb36a6-04b0-419e-a67a-38da5b5f962a_00000.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00000-34cb36a6-04b0-419e-a67a-38da5b5f962a_00002.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00002-34cb36a6-04b0-419e-a67a-38da5b5f962a_00002.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00001-34cb36a6-04b0-419e-a67a-38da5b5f962a_00001.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00002-34cb36a6-04b0-419e-a67a-38da5b5f962a_00001.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00003-34cb36a6-04b0-419e-a67a-38da5b5f962a_00002.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00001-34cb36a6-04b0-419e-a67a-38da5b5f962a_00002.c000.snappy.parquet
/tmp/spark6577489937337534733/part-00000-34cb36a6-04b0-419e-a67a

0

In [15]:
val df = spark.read.table("records")

println(df.rdd.getNumPartitions)
println(df.queryExecution.executedPlan.outputPartitioning)

3
hashpartitioning(category#63, 3)
