# 3 Partitions of dataframe and dataset
Spark is a distributed calculation framework. So data such as rdd, dataframe and dataset are divided into parts. All
parts are called partitions.

When we create a RDD or data frame, data set. A default partition is given. The number of their partition equals the
number of executors in the spark cluster. To improve your computation, you may want to change the default partition
number. The number of partition can be modified by two methods:
- repartition: is used to increase or decrease the partitions
- coalesce: is used only to reduce the number of partitions.

Note coalesce is more efficient than repartition which means fewer data movement across the cluster. As a result,
if you can use coalesce, do not use repartition.


In [1]:
from pyspark.sql import SparkSession
import os

local=True
if local:
    spark = SparkSession.builder\
        .master("local[4]")\
        .appName("RepartitionAndCoalesce")\
        .config("spark.executor.memory", "2g")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443")\
        .appName("RepartitionAndCoalesce")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory","2g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()

# make the large dataframe show pretty
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

21/08/01 10:21:07 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.184.141 instead (on interface ens33)
21/08/01 10:21:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/08/01 10:21:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 3.1 RDD repartition and coalesce

We can check the partition number by using the getNumPartitions() function.

In [2]:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
rdd = spark.sparkContext.parallelize(data)
# The default partition is 4 because I have 4 executor in my cluster(i.e. local[4])
print("Default rdd partition: {}".format(rdd.getNumPartitions()))

# if you don't want to use the default partition, you can give a specific partition
rdd_fix_partition = spark.sparkContext.parallelize(data, 8)
print("rdd_fix_partition partition: {}".format(rdd_fix_partition.getNumPartitions()))

# The same patter also works for textFile(), I will not show them here

# When you save data to disk, the number of partition is the number of part file. Below will generate four files
# rdd.saveAsTextFile("/tmp/partition")
# You will find 4 files in /tmp/partition
# - part-00000:1,2,3
# - part-00001:4,5,6
# - part-00002:7,8,9
# - part-00003:10,11,12
# The skewness of data is 0. Because data is split into 4 partition evenly.

# We can change the partition to 12
rdd_repart = rdd.repartition(12)
print("After repartition rdd has partition: {}".format(rdd_repart.getNumPartitions()))

# We can change the partition to 2
rdd_coalesce = rdd.coalesce(2)
print("After coalesce rdd has partition: {}".format(rdd_coalesce.getNumPartitions()))

Default rdd partition: 4
rdd_fix_partition partition: 8
After repartition rdd has partition: 12
After coalesce rdd has partition: 2


## 3.2 DataFrame repartition

We can create a data frame by using spark session, or RDD. You can notice, none of them allow us to give a specific
partition number. So the spark session will give us a default partition number, which is the number of executor.
In our case, it's 4
- SPARKSESSION: createDataFrame(rdd)/(dataList)/(rowData,columns)/(dataList,schema)/read()
- RDD: toDF()/(*cols)


In [3]:
df = spark.range(0, 20)
df.show()
# note data frame does not provide function to get partition number, we need to convert dataframe to rdd first
# data frame is build on top of rdd. So it has the same partition number as the base rdd.
print("Default data frame partition: {}".format(df.rdd.getNumPartitions()))
df_repart = df.repartition(8)
print("After repartition(8) data frame partition: {}".format(df_repart.rdd.getNumPartitions()))
df_coalesce = df.coalesce(2)
print("After coalesce(2) data frame partition: {}".format(df_coalesce.rdd.getNumPartitions()))

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+

Default data frame partition: 4
After repartition(8) data frame partition: 8
After coalesce(2) data frame partition: 2
+---+----+
| id|plus|
+---+----+
|  0|   2|
|  1|   3|
|  2|   4|
|  3|   5|
|  4|   6|
|  5|   7|
|  6|   8|
|  7|   9|
|  8|  10|
|  9|  11|
| 10|  12|
| 11|  13|
| 12|  14|
| 13|  15|
| 14|  16|
| 15|  17|
| 16|  18|
| 17|  19|
| 18|  20|
| 19|  21|
+---+----+

After withColumn data frame partition: 4
+---+-----+
| id|count|
+---+-----+
|  0|    1|
|  1|    1|
|  2|    1|
|  3|    1|
|  4|    1|
|  5|    1|
|  6|    1|
|  7|    1|
|  8|    1|
|  9|    1|
| 10|    1|
| 11|    1|
| 12|    1|
| 13|    1|
| 14|    1|
| 15|    1|
| 16|    1|
| 17|    1|
| 18|    1|
| 19|    1|
+---+-----+

After GroupBy data frame partition: 4


### 3.2.1 Dataframe auto-repartition after transformation which trigger shuffling.

If the transformation(e.g. groupBy, union, join) triggers a shuffle, the data frame will be transferred
between multiple executors and even machines. This will lead to an automatic repartition of your dataframe. The
default partition number is **200**. You can change this default number by using the **spark.sql.shuffle.partitions**
configuration.

Note if your sparkSession is local, you will note see the 200 partition number. This only apply to spark session in a
cluster.

In pyspark, you can add following line
``` python
spark.conf.set("spark.sql.shuffle.partitions",10)
```

# In scala, you can add following line
``` scala
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, 2)

```

In [4]:
# Note if you don't do data transformation which triggers shuffling, the partition of the result data frame does
# not change.
df_no_shuffle = df.withColumn("plus", df.id + 2)
df_no_shuffle.show()
print("After withColumn data frame partition: {}".format(df_no_shuffle.rdd.getNumPartitions()))


# groupBy triggers a shuffle, thus the dataframe will be repartitioned to the default 200 partitions.
df_with_shuffle = df.groupBy("id").count()
df_with_shuffle.show()
print("After GroupBy data frame partition: {}".format(df_with_shuffle.rdd.getNumPartitions()))

+---+----+
| id|plus|
+---+----+
|  0|   2|
|  1|   3|
|  2|   4|
|  3|   5|
|  4|   6|
|  5|   7|
|  6|   8|
|  7|   9|
|  8|  10|
|  9|  11|
| 10|  12|
| 11|  13|
| 12|  14|
| 13|  15|
| 14|  16|
| 15|  17|
| 16|  18|
| 17|  19|
| 18|  20|
| 19|  21|
+---+----+

After withColumn data frame partition: 4
+---+-----+
| id|count|
+---+-----+
|  0|    1|
|  1|    1|
|  2|    1|
|  3|    1|
|  4|    1|
|  5|    1|
|  6|    1|
|  7|    1|
|  8|    1|
|  9|    1|
| 10|    1|
| 11|    1|
| 12|    1|
| 13|    1|
| 14|    1|
| 15|    1|
| 16|    1|
| 17|    1|
| 18|    1|
| 19|    1|
+---+-----+

After GroupBy data frame partition: 4
