- Author: Benjamin Du
- Date: 2021-12-11 17:21:56
- Modified: 2021-12-11 17:21:56
- Title: Control Number of Partitions of a DataFrame in Spark
- Slug: control-number-of-partitions-of-a-dataframe-in-spark
- Category: Computer Science
- Tags: Computer Science, programming, Spark, PySpark, big data, partition, repartition, maxPartitionBytes

**Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!**

## Tips and Traps

1. `DataFrame.repartition` repartitions the DataFrame by **hash code** of each row. 
    If you specify a (multiple) column(s) (instead of number of partitions) 
    to the method `DataFrame.repartition`,
    then hash code of the column(s) are calculated for repartition. 
    In some situations,
    there are lots of hash conflictions 
    even if the total number of rows is small (e.g., a few thousand),
    which means that partitions generated might be skewed
    and causes a few long-running tasks. 
    If this ever happens, 
    it is suggested that you manually add a column
    which helps the hashing algoirthm. 
    Notice that an existing integer column in the DataFrame 
    is not necessarily a good column to repartition by 
    especially when those integers are big (e.g., u64)
    as hash code of those integers can easily conflicts. 
    It is best to add a column of random numbers 
    of a column of manually curated partition indexes
    and ask Spark to repartition based on that column.
    
2. By default, 
    Spark automatically merges several small files into one partition 
    when loading a HDFS table into a DataFrame. 
    The behavior is control by the parameter `spark.sql.files.maxPartitionBytes`.
    The default value for this option is 128M
    which means that Spark keeps reading small files into one partition of a DataFrame
    until reading another file makes the size of the partition exceedes 128M. 
    Generally speaking,
    you want to keep the default value for this setting 
    as it yields optimal performance for handling large data table. 
    However,
    if your Spark application deals with small data but is CPU intensive,
    it makes more sense to set a much smaller value for `spark.sql.files.maxPartitionBytes`
    so that there are more partitions generated and yield a higher level of parallelism.
    Of course,
    you can always repartition a DataFrame manually. 