# Coalesce vs Repartition

---
#### `Load required libraries`

---

In [1]:
from pyspark.sql.functions import year, month, dayofmonth
from pyspark.sql import SparkSession
from datetime import date, timedelta
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField

---
#### `Spark configurations`

---

In [2]:
# Spark configuration
appName = "PySpark Partition Example"
master = "local[8]"

---
#### `SparkSession object`

---

In [3]:
# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

---
#### `Create sample data`

---

In [4]:
# Populate sample data
start_date = date(2019, 1, 1)

data = []

# 2 million records
for i in range(0, 1000000):
    data.append({"Country": "CN", "Date": start_date +
                 timedelta(days=i), "Amount": 10+i})
    data.append({"Country": "AU", "Date": start_date +
                 timedelta(days=i), "Amount": 10+i})

In [5]:
# Schema for sample data
schema = StructType([StructField('Country', StringType(), nullable=False),
                     StructField('Date', DateType(), nullable=False),
                     StructField('Amount', IntegerType(), nullable=False)])

# Create dataframe
df = spark.createDataFrame(data, schema=schema)

In [6]:
# Display dataframe
df.take(5)

[Row(Country='CN', Date=datetime.date(2019, 1, 1), Amount=10),
 Row(Country='AU', Date=datetime.date(2019, 1, 1), Amount=10),
 Row(Country='CN', Date=datetime.date(2019, 1, 2), Amount=11),
 Row(Country='AU', Date=datetime.date(2019, 1, 2), Amount=11),
 Row(Country='CN', Date=datetime.date(2019, 1, 3), Amount=12)]

Check number of partitions.

In [7]:
# Get number of partitions
df.rdd.getNumPartitions()

8

---
#### `Repartitioning Data`

---

There are two functions you can use in Spark to repartition data:

`1. coalesce`

`2. repartition`

---

#### `Repartition with Coalesce`

---

When `coalesce` is defined on an RDD, this operation results in a `Narrow dependency`. For example, if you go from 1000 partitions to 100 partitions, there will no shuffle. Instead, each of the 100 new partitions will claim 10 of the current partitions. However, if a larger number of partitions are requested, it will no happen as `coalesce` does not increase partitions.

##### `Increase Partitions with Coalesce`

---

In [8]:
# Get original number of partitions
df.rdd.getNumPartitions()

8

Increase to 16 partitions.

In [9]:
# Coalesce
df = df.coalesce(16)

Number of partitions remains same as before.

In [10]:
# Get number of partitions
df.rdd.getNumPartitions()

8

##### `Decrease Partitions with Coalesce`

---

Decrease to 4 partitions.

In [11]:
# Get original number of partitions
df.rdd.getNumPartitions()

8

In [12]:
# Coalesce
df = df.coalesce(4)

In [13]:
# Get number of partitions
df.rdd.getNumPartitions()

4

---

#### `Repartition with Repartition`

---

The other method for repartitioning is `Repartition`. It’s defined as the follows:

###### # Defining repartition
`def repartition(numPartitions, *cols)`

Returns a new `DataFrame` partitioned by the given partitioning expressions. The resulting DataFrame is `hash partitioned`.

`numPartitions` can be an int to specify the target number of partitions or it could also be a Column.  If it is a Column, then the data will be partitioned based on the column. If not specified, the default number of partitions is used.

##### `Increase Partitions with Repartition`

---

In [14]:
# Get original number of partitions
df.rdd.getNumPartitions()

4

In [15]:
# Repartition
df = df.repartition(10)

In [16]:
# Get number of partitions
df.rdd.getNumPartitions()

10

##### `Decrease Partitions with Repartition`

---

In [17]:
# Get original number of partitions
df.rdd.getNumPartitions()

10

In [18]:
# Repartition
df = df.repartition(8)

In [19]:
# Get number of partitions
df.rdd.getNumPartitions()

8

##### `Repartition according to Column value`

---

We can also repartition by columns.

For example, let’s run the following code to repartition the data by column `Country`.

In [20]:
# Repartition
df = df.repartition("Country")

In [21]:
# Get number of partitions
df.rdd.getNumPartitions()

200

The above scripts will create 200 partitions (Spark by default creates 200 partitions). Only two will contain the data:
- one partition stores data for CN country
- second partition stores data for AU country

`Check data per partition`

* [spark_partition_id](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.spark_partition_id)

In [22]:
# Partition id
from pyspark.sql.functions  import spark_partition_id

In [23]:
# Partition id for each record
df.select("*", spark_partition_id())

Country,Date,Amount,SPARK_PARTITION_ID()
CN,2019-01-01,10,28
CN,2019-01-02,11,28
CN,2019-01-03,12,28
CN,2019-01-04,13,28
CN,2019-01-05,14,28
CN,2019-01-06,15,28
CN,2019-01-07,16,28
CN,2019-01-08,17,28
CN,2019-01-09,18,28
CN,2019-01-10,19,28


In [24]:
# Count record per partition
df.groupBy(spark_partition_id()).count().show()

+--------------------+-------+
|SPARK_PARTITION_ID()|  count|
+--------------------+-------+
|                  28|1000000|
|                  64|1000000|
+--------------------+-------+

