- Date: 2020-09-26 09:01:58
- Title: Partition and Bucketing in Spark
- Slug: partition-bucketing-in-spark
- Category: Computer Science
- Tags: Computer Science, Spark, big data, bucket, partition
- Modified: 2021-01-26 09:01:58


## Tips and Traps

1. Bucketed column is only supported in Hive table at this time. 

2. A Hive table can have both partition and bucket columns.

2. Suppose `t1` and `t2` are 2 bucketed tables and with the number of buckets `b1` and `b2` respecitvely.
    For bucket optimization to kick in when joining them:

        - The 2 tables must be bucketed on the same keys/columns.
        - Must joining on the bucket keys/columns.
        - `b1` is a multiple of `b2` or `b2` is a multiple of `b1`.
        
    When there are many bucketed table that might join with each other, 
    the number of buckets need to be carefully designed so that efficient bucket join can always be leveraged.

2. Bucket for optimized filtering is available in Spark 2.4+.
    For examples,
    if the table `person` has a bucketed column `id` with an integer-compatible type,
    then the following query in Spark 2.4+ will be optimized to avoid a scan of the whole table.
    A few things to be aware here. 
    First, 
    you will still see a number of tasks close to the number of buckets in your Spark application.
    This is becuase the optimized job will still have to check all buckets of the table 
    to see whether they are the right bucket corresponding to `id=123`.
    (If yes, Spark will scan all rows in the bucket to filter records.
    If not, the bucket will skipped to save time.)
    Second, 
    the type of the value to compare must be compartible in order for Spark SQL to leverage bucket filtering.
    For example,
    if the `id` column in the `person` table is of the BigInt type 
    and `id = 123` is changed to `id = "123"` in the following query,
    Spark will have to do a full table scan (even if it sounds extremely stupid to do so).

        :::sql
        SELECT *
        FROM persons
        WHERE id = 123
        

## Tricks and Trap on `DataFrame.write.partitionBy` and `DataFrame.write.bucketBy`

Partition is an important concept in Spark
which affects Spark performance in many ways. 
When reading a table to Spark,
the number of partitions in memory equals to the number of files on disk if each file is smaller than the block size,
otherwise, 
there will be more partitions in memory than the number of files on disk.
Generally speaking,
there shouldn't be too many small files in a table as this cause too many partitions (and thus small tasks) in the Spark job.
When you write a Spark DataFrame into disk,
the number of files on disk usually equals to the number of partitions in memory
unless you use `partitionBy` or `bucketBy`.
Suppose there is a DataFrame `df` which has `p` partitions in memory 
and it has a column named `col` which has `c` distinct values $v_1$, ..., $v_c$,
when you write `df` to disk using `df.write.partitionBy(col)`,
each of the `p` partitions in memory is written to separate partitions into the `c` directories on disk. 
This means that the final resulted number of partitions can be up to $c * p$.
This is probably not what people want in most situations,
instead,
people often want exact $c$ partitions on disk when they call `df.write.partitionBy(col)`.
According to the above explanation on how `Data.write.partitionBy` works,
a simple fix is to have each partition in memory corresponding to a distinct value in the columnd `df.col`.
That is a repartition of the DataFrame using the col `col` resolves the issue.

    :::python
    df.repartition(col).partitionBy(col)

The above issue is not present when you `DataFrame.write.bucketBy` 
as `DataFrame.write.bucketBy` works by calculating hash code. 
There will always be the exact number of buckets/partitions on the disk 
as you specifed when you call the function `DataFrame.write.bucketBy`.

In [1]:
import findspark
findspark.init("/opt/spark-3.0.0-bin-hadoop3.2/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark_Union") \
    .enableHiveSupport().getOrCreate()

In [3]:
df = spark.read.option("header", "true").csv("../../home/media/data/daily.csv")
df = df.select(
    year("date").alias("year"),
    month("date").alias("month"), "date", "x", "y", "z"
).repartition(2)
df.show()

+----+-----+----------+------------------+------------------+------------------+
|year|month|      date|                 x|                 y|                 z|
+----+-----+----------+------------------+------------------+------------------+
|2018|   12|2018-12-12|          15218.66|343419.90721800004|136.56000000000003|
|2018|   12|2018-12-14|12127.650000000005|     252696.129202|125.28000000000002|
|2018|   12|2018-12-05| 35484.22999999998|     442708.934149|            230.76|
|2018|   10|2018-10-28|28418.420000000016|     515499.609327|268.80000000000007|
|2019|    1|2019-01-07|          29843.17|     375139.756514|172.62000000000003|
|2019|    1|2019-01-09|          30132.28|     212952.094433|            128.52|
|2018|   11|2018-11-22| 38395.96999999998|     437842.863362|            237.12|
|2018|   11|2018-11-23|          38317.15|391639.59950300003|212.22000000000003|
|2018|   12|2018-12-30| 7722.129999999999|     210282.286054| 85.80000000000003|
|2018|   10|2018-10-17|11101

In [4]:
df.rdd.getNumPartitions()

2

In [8]:
df.write.mode("overwrite").partitionBy("year").parquet("part_by_year.parquet")

In [11]:
!ls part_by_year.parquet/

_SUCCESS  [1m[36myear=2018[m[m [1m[36myear=2019[m[m


Spark support multiple levels of partition.

In [9]:
df.write.mode("overwrite").partitionBy("year",
                                       "month").parquet("part_by_year_month.parquet")

In [10]:
!ls part_by_year_month.parquet/year=2018/

[1m[36mmonth=10[m[m [1m[36mmonth=11[m[m [1m[36mmonth=12[m[m


In [24]:
spark.read.parquet("daily.parquet").rdd.getNumPartitions()

4

In [17]:
df.repartition("year").write.mode("overwrite").partitionBy("year"
                                                          ).parquet("daily.parquet")

In [18]:
!ls daily.parquet/year=2018

part-00015-76ce0363-393a-4e1a-a387-488170fdcfbf.c000.snappy.parquet


In [19]:
!ls daily.parquet/year=2019

part-00081-76ce0363-393a-4e1a-a387-488170fdcfbf.c000.snappy.parquet


In [25]:
spark.read.parquet("daily.parquet").rdd.getNumPartitions()

4

In [26]:
df.write.mode("overwrite").partitionBy("year").saveAsTable("daily_hive")

In [28]:
spark.table("daily_hive").rdd.getNumPartitions()

4

In [29]:
df.createOrReplaceTempView("df")

In [35]:
spark.sql(
    """
    create table daily_hive_2
    using parquet     
    partitioned by (year) as
    select * from df
    """
)

DataFrame[]

In [37]:
spark.table("daily_hive_2").rdd.getNumPartitions()

4

## Filtering Optimization Leveraging Bucketed Columns

### Spark 3

In [1]:
import findspark
findspark.init("/opt/spark-3.0.0-bin-hadoop3.2/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark_Union") \
    .enableHiveSupport().getOrCreate()

In [2]:
df = spark.read.option("header", "true").csv("../../home/media/data/daily.csv")
df = df.repartition(2)
df.show()

+----------+------------------+------------------+------------------+
|      date|                 x|                 y|                 z|
+----------+------------------+------------------+------------------+
|2018-10-22|           10779.9|234750.19368899995|150.78000000000003|
|2018-12-07|15637.329999999998|281424.52784600004|147.36000000000004|
|2018-12-21|           4797.22|106753.64014699995|             47.46|
|2018-10-17|11101.180000000006|243019.40156300002|            150.84|
|2018-11-09|           25519.6|     287836.930741|184.01999999999995|
|2018-11-28|           39134.8|446640.72524799994|             225.3|
|2018-12-14|12127.650000000005|     252696.129202|125.28000000000002|
|2018-12-09|          14820.05|     407420.724814|167.81999999999996|
|2018-11-27|38929.669999999984|441879.99280600005|244.50000000000009|
|2018-12-18|           7623.48|     189779.703736| 90.05999999999996|
|2018-12-20| 5015.930000000001|120790.77259400001| 46.13999999999999|
|2019-01-02|        

In [7]:
df.rdd.getNumPartitions()

2

In [4]:
df.write.bucketBy(10, "date").saveAsTable("daily_b2")

In [5]:
spark.table("daily_b2").rdd.getNumPartitions()

10

Notice the execution plan does leverage bucketed columns for optimization.

In [5]:
spark.sql(
    """
    select 
        * 
    from 
        daily_b
    where
        date = "2019-01-11"
    """
).explain()

== Physical Plan ==
*(1) Project [date#53, x#54, y#55, z#56]
+- *(1) Filter (isnotnull(date#53) AND (date#53 = 2019-01-11))
   +- *(1) ColumnarToRow
      +- FileScan parquet default.daily_b[date#53,x#54,y#55,z#56] Batched: true, DataFilters: [isnotnull(date#53), (date#53 = 2019-01-11)], Format: Parquet, Location: InMemoryFileIndex[file:/opt/spark-3.0.0-bin-hadoop3.2/warehouse/daily_b], PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,2019-01-11)], ReadSchema: struct<date:string,x:string,y:string,z:string>, SelectedBucketsCount: 1 out of 10




### Spark 2.3

In [2]:
import findspark
findspark.init("/opt/spark-2.3.4-bin-hadoop2.7/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark23 = SparkSession.builder.appName("PySpark_Union") \
    .enableHiveSupport().getOrCreate()

In [5]:
df = spark23.read.option("header", "true").csv("../../home/media/data/daily.csv")
df.show()

+----------+------------------+------------------+------------------+
|      date|                 x|                 y|                 z|
+----------+------------------+------------------+------------------+
|2019-01-11|               0.0|               0.0|               0.0|
|2019-01-10| 30436.96000000001|               0.0|               0.0|
|2019-01-09|          30132.28|     212952.094433|            128.52|
|2019-01-08|29883.240000000005|      352014.45016|            192.18|
|2019-01-07|          29843.17|     375139.756514|172.62000000000003|
|2019-01-06|          29520.23| 420714.7821390001|            217.98|
|2019-01-05|          29308.36|376970.94769900007|             183.3|
|2019-01-04|31114.940000000013|339321.70448899985|174.59999999999997|
|2019-01-03|          30953.24|383834.70136999997|            197.52|
|2019-01-02|          29647.83|     379943.385348|             199.2|
|2019-01-01| 9098.830000000004|     221854.328826|             88.26|
|2018-12-31|3522.929

In [6]:
df.write.bucketBy(10, "date").saveAsTable("daily_b")

In [8]:
spark23.table("daily_b").rdd.getNumPartitions()

10

Notice the execution plan does not leverage bucketed columns for optimization.

In [9]:
spark23.sql(
    """
    select 
        * 
    from 
        daily_b
    where
        date = "2019-01-11"
    """
).explain()

== Physical Plan ==
*(1) Project [date#44, x#45, y#46, z#47]
+- *(1) Filter (isnotnull(date#44) && (date#44 = 2019-01-11))
   +- *(1) FileScan parquet default.daily_b[date#44,x#45,y#46,z#47] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/opt/spark-2.3.4-bin-hadoop2.7/warehouse/daily_b], PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,2019-01-11)], ReadSchema: struct<date:string,x:string,y:string,z:string>


## References

https://mungingdata.com/apache-spark/partitionby/

https://databricks.com/session_na20/bucketing-2-0-improve-spark-sql-performance-by-removing-shuffle

https://issues.apache.org/jira/browse/SPARK-19256

https://stackoverflow.com/questions/44808415/spark-parquet-partitioning-large-number-of-files

https://stackoverflow.com/questions/48585744/why-is-spark-saveastable-with-bucketby-creating-thousands-of-files