- Author: Ben Du
- Date: 2020-06-24 13:25:39
- Title: Coalesce and Repartition in Spark DataFrame
- Slug: spark-dataframe-coalesce-repartition
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, repartition, coalesce
- Modified: 2022-01-18 14:34:12


## References

https://stackoverflow.com/questions/42171499/get-current-number-of-partitions-of-a-dataframe

## coalesce vs repartition

https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4m

In [1]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1

In [2]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
    .master("local[2]")
    .appName("Spark Column Example")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()

import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@60ad5f12

In [3]:
val df = spark.read.json("../../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



null

## Get Number of Partitions 

In [4]:
df.rdd.getNumPartitions

1

## Repartition

In [6]:
val df2 = df.repartition(4)

[age: bigint, name: string]

In [7]:
df2.rdd.getNumPartitions

4

## References

- [Control Number of Partitions of a DataFrame in Spark](http://www.legendu.net/en/blog/control-number-of-partitions-of-a-dataframe-in-spark/)

- [Partition and Bucketing in Spark](http://www.legendu.net/misc/blog/partition-bucketing-in-spark/)

- https://stackoverflow.com/questions/30995699/how-to-define-partitioning-of-dataframe

- https://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where

- https://issues.apache.org/jira/browse/SPARK-22614

- https://mungingdata.com/apache-spark/partitionby/